NESUG 2007 Applications Big and Small

SAS®, ECONOMETRICS, AND COMPUTATION SPEED MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA

ABSTRACT This paper discusses how researchers can take advantage of SAS and the ever-growing power of personal computers (PCs). Due to previously unimaginable processing speeds and the overall capabilities of relatively inexpensive PCs, many procedures that were once considered computationally prohibitive are now within the reach of almost any one that desires to use a preferred econometric technique—and SAS is likely to have the desired capabilities. Still, there are cases where forcing any statistical package such as SAS to perform a task for which it is not well suited is not a good choice, simply because other easy-to-use econometrics tools are readily available.

INTRODUCTION Modern personal computers (PCs) have processing and storage capabilities that make the first few vintages of this technology seem toy-like in retrospective. In fact, inexpensive PCs now rival the power that could only be found in what was called a ‘’ just a few years ago. Therefore, empirical researchers in various fields, including all areas of the social and physical sciences, and specialties such as pharmaceuticals and financial economics, have many options for data management and statistical analysis. This paper discusses the extent in which SAS takes advantages of new technology and remains one of the more user-friendly options for general econometrics. Because the speed in which data can be manipulated and can be computed has increased over time, many procedures that were once considered computationally prohibitive are now well within the reach of almost any one that desires to use a particular econometric technique. Furthermore, extremely large databases can be stored in SAS format on a single-user system at a relatively low cost, or just as easily and cheaply, the same data can be made available to many users through either a local network or the internet. In fact, due to less costly and more powerful computing technology, packages such as SAS no longer have large computational handicaps when compared to more general computing languages (such as and /C++). In other words, for most econometric applications, the speed of a SAS program is effectively no different than the speed of custom-written FORTRAN or C program. At the same time, the technical expertise that was once required to write custom programs in other computer languages has declined considerably. Not only have programming interfaces become more user friendly, many of the alternatives to SAS are fairly standardized (among themselves) and differences from the SAS programming environment are widening. Therefore, SAS users may feel (rightly or wrongly) that they can not take advantage of new developments in computer languages and data manipulation tools, especially object-oriented programming concepts and the open-source software movement. In discussing these issues and the advantages of SAS, I consider how advances in computer technology affect the capabilities and the use of any econometric software package. I also try to place SAS in a perspective that recognizes why it became it became popular as an alternative to traditional programming and how it compares other options on a forward-looking basis. This paper is not a review or ranking of statistical packages in the conventional sense, however. The main goal is present a framework for making choices across types of software packages, such as a decision to use a general programming language for econometrics, or instead a mature statistical package such as SAS, or perhaps alternatives such as MATLAB. The MATLAB case is most interesting because it has gained popularity among econometricians that need to employ non-standard econometric techniques or prefer to use custom-written procedures. While SAS has a module that has similarities with the MATLAB environment, its use seems to be very low even among dedicated SAS users. The general conclusion of this analysis is that user friendliness and the personal preferences will remain the overriding factor, such that different choices by researchers with different backgrounds are completely reasonable. Still, there are cases where forcing a program or statistical package to perform a task for which it is not well suited is not a good choice, and a comparison of the different programming options can help to determine the best choice. In the final analysis, technological progress on both the hardware and software side allows researchers to use the tools that they find to be the most comfortable to use, and they are not forced into a one- solution-fits-all world. Moving forward, SAS should continue to be used by serious econometricians with diverse interests and needs, and also by more casual empirical researchers in many different fields.

1 NESUG 2007 Applications Big and Small

ECONOMETRIC PACKAGE CATEGORIES It is not possible to evaluate all of the important issues for all econometric packages that could be chosen. Various econometric software lists on the internet that have 60 to 100 entries (although these lists tend to double- count a few versions of different packages and include defunct packages), and there are at least 30 packages that could be considered substitutes for the most used features of SAS. It is not even feasible to compare all aspects of a few popular packages. It is possible, however, to identify three fairly distinct categories of computer languages or software packages for data analysis and applied econometrics: 1. Traditional programming languages (examples: Basic, C/C++, FORTRAN, and Java). 2. Statistical packages (examples: EVIEWS, SAS, , and TSP). 3. Matrix algebra oriented and other mathematical language tools (examples: GAUSS, Mathematica, MATLAB, and Splus). For the taxonomy above, a distinction can be made between the package and any user-supplied programs. Here, the term ‘package’ is used to denote any self-contained software application, such as SAS. The package concept includes any module that is typically part of a basic or default installation, or is popular with econometricians, such as SAS’s statistics (STAT) and econometric (ETS) modules. Similarly, a FORTRAN package would include both the and ‘libraries’ that contain functions that are most commonly used by FORTRAN (such as LINPACK and GQOPT). The term ‘program’ is reserved for user-written applications that may or may not use the package’s library of functions. In this context, ‘code’, ‘script’, and ‘program’ are all synonymous as the user-written program files. In some cases, however, user-written code for data manipulation and analysis has evolved into routines and libraries that are either commonly shared or sold as add-on package (and in some cases in more than one packages, such as the IMSL routines), and the distinction between the package and user-written code is not always a sharp distinction. All of the packages listed above make use of higher-level language concepts that abstract from the low-level assembly and machine languages that give computer s their actual instructions. High-level languages use English words such as IF, THEN, ELSE, DO, FOR, WHILE, OUTPUT and PRINT to represent logical choices, steps, and specific actions, which are then translated into assembly and machine language commands using intermediary programs knows as ‘’, which translate the entire program before executing its instructions, and ‘interpreters’, which translate and execute programs in stages (often line-by-line). The alternative is to write assembly or machine language commands directly, which would require detailed knowledge of the computer’s structure and memory registers and would make the estimation of even simple econometric equations almost impossible for anyone that is not a dedicated computer scientist. FORTRAN and C/C++ are clearly the premier cases for the first category. FORTRAN was invented in the mid- 1950s, and some call it the ‘language that refuses to die.’ Most important, it is still know for its speed, mainly because computer scientist continue to work on optimizing FORTRAN compilers, and not because the languages original design was so strong. In fact, FORTRAN syntax is often cursed by new users because it is difficult to learn and master. C/C++ is noteworthy because of its embrace of the object-oriented programming paradigm, and most of the packages in the second and third categories were written in this language. This suggests, but does not necessarily prove, that C/C++ it is the most powerful and flexible programming tool available. SAS is a good representative for the second group because it is the most popular and mature package in this category. SAS can be readily used by almost anyone to manage data, estimate models, and run statistical tests without any formal training in computer programming. The basic structure and conceptual framework for SAS programs has not changed greatly since it was introduced commercially in the 1970s, but the number of systems on which it can be used, its user interface, and its power in managing databases has grown considerably, and for the better. Currently SAS has at least ten serious competitors, with Stata probably having the second largest group of users in this category. Still, SAS’s user base dwarfs all of its competitors in both the commercial and academic arenas. I consider MATLAB to be the most popular and best representative of the third group for econometric analysis, but not all would agree with this assessment. MATLAB was introduced in the 1980s (as was its main commercial competitor: GAUSS), as a way to allow non-programmers to write code that uses concise, matrix algebra based formulas. Irrespective of MATLAB’s role as a leader in the group, this category seem to be expanding faster than the other categories, both in the number of serious packages and serious users.

2 NESUG 2007 Applications Big and Small

Both SAS and MATLAB have an enthusiastic set of users, active Internet newsgroups, and large archives of user written programs. On the cost side, SAS and MATLAB are the most expensive in their categories (especially for commercial users, with a much lower licensing cost differential for academic/educational use). Admittedly, these two packages are not as popular among time series econometricians as other packages (such as RATS, TSP and EVIEWS). Still, SAS and MATLAB are much stronger for time series econometrics than the other packages are managing data and estimating either pure cross-section models or panel–type models. Both can be used to write code for new developed time-series procedures, and these two packages continue to improve their offering to time-series econometricians. It is also interesting to know that the companies that provide SAS and MATLAB are similar in many respects. Both started as ‘academic’ projects, but have since gone commercial and private, and each reports exponential growth in employees over the past 10 years. Such employee growth is probably the best metric of their success and standing among their competition. With large employee bases, both provide high levels of support, and after their early inceptions, neither has favored any (such as MS-Windows or ) over another. These two packages seem most likely to remain innovative and popular, even though they use completely different programming paradigms. A fourth category for ‘scripting’ languages such as Perl and Python that make use of specially developed statistical packages could also be considered, but this grouping would overlap with the first and third categories in many respects. For example, both Perl and Python can ‘call’ C and FORTRAN subroutines, as can many of he packages in the third category. Most noteworthy in this regard is the fact that the Numpy module in Python can be used to write code that resembles the manner in which arrays are used in GAUSS and MATLAB. Also, both Python and Perl have modules that effectively embed R as an intrinsic library. Thus, languages such as Perl and Python can be treated as examples of the third category. Another way to look at the three categories is that that they represent packages that became popular and ‘mature’ almost sequentially. There are a few exceptions to this view, such as Java, and more important, it is not fair to treat the older packages as inferior because they are not modern. A fair evaluation of the packages in the three groups should consider: (1) how easy it is to write programs in their high-level language, which is basically a normative evaluation of the intuitive nature and logic of the language’s syntax rules, (2) debugging ease, including the difficulty of writing valid program code, and the potential for making and not finding or correcting mistakes, (3) the performance of programs in terms of speed and computer memory conservation, (4) interfaces to other programming tools, and (5) database management issues.

REVIEW OF THE REVIEWS There have been countless reviews and comparisons of statistical and econometric software packages. Most of these articles are not comprehensive in covering the plethora of choices, however. There are other problems with the reviews as well. Rycoft (1999) provides a fairly extensive list, showing 51 packages from 32 companies, along with a comparison of features, but the latter is extremely dated. Renfro (2004) counts over 30 standalone packages in an effort to document the state of the art of econometric software as of 2003, but inexplicably does not include SAS or MATLAB (even though Stata and GAUSS are included). The majority of the other reviews cover only one or two packages and some are not much deeper than a list of the features that are likely to be of interest to an econometrician. Although the best reviews explain the main strengths and weaknesses of each reviewed package, almost none conclude that one package is a clear-cut superior to another. Two noteworthy reviews that somewhat different are a review of econometric software for the American Economic Association’s Journal of Economic Perspectives by Jeffrey MacKie-Mason (1992) and a series of web published reports by Stefan Steinhaus (dated 1997, 1999, and 2002). MacKie-Mason attempts to help ‘practising economists decide which [econometric software] tool will best get the task done.’ In the end, however, MacKie-Mason could not select a single package as the unqualified winner or most useful tool for all users among: GAUSS, Limdep, RATS, SAS, SST, Stata, and TSP. He did conclude that TSP was the best overall package, save for data handling, but qualified this selection by stating that SAS had the most comprehensive set of statistical procedures and was unchallenged in its data importing and database creation tools. SAS was deemed to have one major shortcoming: ‘it can be overwhelming for occasional users’. He also stated that GAUSS was a good choice for those that enjoy programming, and Stata is probably the easiest to learn of the seven packages.’ Although MacKie-Mason’s review is thoughtful and contains good advice, it is dated in terms of the versions of the packages he examined simply because of the fast pace for change and improvements among the packages. Most of the concerns or problems that he noted about the available statistical procedures and tests have long been corrected, as have deficiencies in documentation and help facilities. In additional, the data handling methods of

3 NESUG 2007 Applications Big and Small

each package has been completely revamped in almost all cases. Finally, it is no stretch to believe that simply adding the current user interfaces to any of the packages that he studied would have been enough to tip the scales far in the favor of that package. Nonetheless, MacKie-Mason’s final advice and conclusions are still consistent with general impressions about these packages. For example, the matrix-oriented features of GAUSS was seen as powerful and innovative in the early 1990s, and although GAUSS is now not as unique and innovative by today’s standard, this paradigm has proven itself to be extremely useful to econometricians. In the other noteworthy comparison, Steinhaus’s scored packages that are primarily in the third category in terms of their functionality using the following weights: Mathematical functions 38%, Graphical functions 10%, Programming environment 9%, Data import/export 5%, Available operating systems 2%, Speed comparison 36%. In the 2002 ranking, MATLAB received the overall highest score, 69 out of 100 points, followed by GAUSS, Ox, and Mathematica. (SAS was not considered.) In the 2002 speed tests, the highest ratings went to O-Matrix, Ox, MATLAB, and GAUSS, in that order, but the speed scores are close enough to be considered operationally equivalent. Problems or issues with the speed issue are discussed below, and even in terms of overall scores, it is hard to consider the evaluations as completely objective or applicable to all potential users of the packages.

A DIGRESSION ON COMPUTER SPEED Even causal users of computers are well aware of the extremely fast computing speeds and other advances that are now available on low-priced personal computer. For example, the original IBM PC model sold in 1981 at today’s price of about $5,500, but had no hard disk, a low resolution monochrome monitor, no networking capabilities, and what many now considered a flawed chip design and operating system. Today, one can buy a PC for under $1500 that has 3000 times faster computing speeds, a 100 Gig hard drive, a CD reader and burner, a vibrant color monitor, and built-in networking options. One would certainly expect that such advances would affect how computers are used for econometric purposes. Figure 1 puts the computing speed advances of the past 12 years into perspective. The table records Whetstone scores, which is an industry benchmark test for a chip’s ability to make floating point calculations, using what were state-of-the-art PC chip speeds at different points in time.

Fig. 1.0 PC CHIP SPEEDS AND PERFORMANCE CPU Clockspeed Year MWIPS Time index Intel Pentium I 120 Mhz 1995 79 100.0 Intel Pentium II 266 Mhz 1997 218 36.2 Intel Pentium III 550 Mhz 1999 448 17.6 Intel Pentium IV 1.8 Ghz 2001 638 12.4 Intel Pentium IV 3.6 Ghz 2003 1342 5.9 AMD Opteron 2.0 Ghz 2003 1580 5.0 AMD Athlon 64 2.1 Ghz 2003 1720 4.6 Intel Core 2 Duo * 2.4 Ghz 2006 2057 3.8

MWIPS = Mean Whetstone Instructions per Second. A higher MWIPS score is better (i.e. faster chip), and a twice as high MWIPS translates to roughly 50% less time to make an average numerical calculation. Source: http://homepage.virgin.net/roy.longbottom/whetstone.htm

These scores show that new chips have the capability to compute mathematical functions 8-to-18 times faster than chips that were popular in the late-1990s. And it is likely that computation speeds will continue to double at least every 18 to 24 months (based on an implication of Moore’s law that predicts that ‘the number of transistors per square inch on integrated circuits doubles every year), with dual and multi-core processors likely even to accelerate the pace.

4 NESUG 2007 Applications Big and Small

While MWIPS column in Figure 1 helps to gauge the increase in processor speed, the ‘Time Index’ is better way to look at the issue. This measure inverts the number of calculations per second to show the time for a fixed number of calculations. While there may be no limit to a statistic such as MWIPS, there is a limit to the time improvement for a particular set of calculations. In essence, it is more relevant to know the time for a fixed number of statistical calculations, which corresponds to the time index above, as opposed to knowing the number of statistical calculation that can be computed in a fixed period of time. Most important, the later can not break through the ‘zero’ point (it bounded by zero). While one might be impressed to hear that system A is four times faster than system B, they should consider absolute speed and the diminishing returns to relative speed. The ability to lower an 8 minute calculation by a factor of four to 2 minutes is much more valuable than lowering the calculation time by another factor of four to 30 seconds. Lowering a calculation from 1.0 second to 0.25 seconds is the more realistic case today and only yields an absolute speed increase of 0.75 seconds, which is not likely to be noticeable or worthwhile to most users if doing so is time consuming and has other set-up costs. In essence, relative speed increases can be very misleading and the speed issue needs to be part of a more complete and broader perspective for evaluating econometric software. To put the computer technology and speed issue in another perspective, consider the series of speed tests for mathematical/statistical packages that were reported by Stefan Steinhaus (discussed above). In any of these evaluations, using one these four packages: GAUSS, MATLAB, O-Matrix, or Ox on a PC that was just 2 times faster than the PC used by the other packages would put that package at the top, both in the speed tests and in the overall score. While this change in the benchmark PC might seem unfair, it is often a choice that a user of a particular econometric package can make: instead of investing in the time that is required to learn a new and supposedly-faster package, it would make more sense to invest in a new and clearly faster PC. To put specific numbers behind this point, it is interesting to know that in a 1999 speed test, Steinhaus used MATLAB (version 5.3) running on Pentium-II 400 MHz Windows PC to invert a 1000x1000 matrix in 44.0 seconds. For the 2002 test, the same-size matrix was inverted in 7.9 seconds (using MATLAB version 6.3) on Pentium-III 550 MHz Windows PC. I repeated the same exercise in 4.6 seconds (using MATLAB version 7.0) on a Pentium M 1.7 Ghz Windows laptop, and in 2.0 seconds on a Pentium IV 3.0 Ghz Windows PC. Comparing the 1999 and 2002 tests shows a speed increase of 600%, and comparing the 2002 tests to my tests shows a speed increase of between 100% to 300%. All of these increases in speed greatly exceed the difference between the fastest and slowest packages in Steinhaus’s tests in any year. The main conclusion is that the progress of PC technology can make almost any package run much faster on a new PC, compared to an older and previously ‘fast’ PC. Thus, speed benchmarks, as they are normally conducted on a single PC, can be misleading.

AN ALTERNATIVE FRAMEWORK FOR EVALUATION This section makes use of a stylized model that represents the elements that are most relevant in deciding the best package to use for a particular econometric task and puts the speed issue in better perspective. Although the ‘model’ is extremely abstract, it is useful because it considers in concrete terms how ‘computation speed’ may or may not matter. Assume an econometric software users desire to minimize combined computation speed and programming effort, where the driving factor for each element is a complexity metric. This metric can be thought of as the number of computations needed to complete the task, such that the higher the complexity number, the longer the computation time and in most cases the greater the programming effort. Furthermore, assume that each package has its own functional parameters for how they handle complexity (i.e. translate complexity into computation time and programming effort) and these functions are quadratic: Computation Time = (a0 + a1*x + a2* x2) / speed Programming Effort = (b0 + b1*x + b2*x2) * cost The parameters, a0 and b0 represent set-up time, while a1 and b1 are the linear coefficients and a2 and b2 are the quadratic coefficients. All coefficients are assumed to be non-negative, with the linear coefficients (a1 and b1) generally the most important terms, but high quadratic (a2 and b2) are also realistic for certain cases. The ‘speed’ and ‘cost’ terms are common parameters that affect the functions for all packages in the same manner, where the later translates programming effort into a time-comparably cost.

5 NESUG 2007 Applications Big and Small

Below is a graphical representation of these functions (for particular parameterizations), where Package 2 is assumed to be much faster than Package 1, and only Package 1 is assumed to have a non-zero quadratic coefficient (although relatively small in this case). The first panel in the chart shows a case where Package 2 would be selected irrespective of the simplicity or complexity of the task if Computational Time was all that mattered. (In this case Package 2 is assumed to have a greater than 4:1 advantage in speed, with time to compute being the inverse of speed.) The second panel of Chart 1.1 provides a User Programming Element or effort comparison where it assumed that Package 1 has an advantage for most tasks, but very complex cases are better handled with Package 2.

Chart 1.2 Comparison of Computation and Programming Time

10 Computation Time Element

Package 1 (slow and easy) 5 Package 2 (fast and hard)

Complexity 0 0 1 2 3 4 5 6 7 8 9 10 30 User Programming (Time and Effort) Element 20

10

0 0 1 2 3 4 5 6 7 8 9 10 15 Differences in Costs 10 Total Computation 5 Programming 0

-5 0 1 2 3 4 5 6 7 8 9 10

Admittedly, the parameterization for this example has arbitrary elements, but the general relationship between programming effort and computational time for the average task (complexity=5) is a reasonable baseline at roughly 2:1 for Package 1. The final panel combines both elements, showing that Package 2 is the better choice for any task with a complexity level above 4. Chart 1.2 shows the same basic structure, with only two changes: an increase in speed by a factor of 10 for both packages (which is meant to represent the speed advantage of a new computer generation), and a decrease in the common programming effort cost by one-half. The net effect is to increase the complexity range from 0-4 to 0- 6 where Package 1 has the advantage.

6 NESUG 2007 Applications Big and Small

Chart 1.2 Comparison of Computation and Programming Time with Improvement in Relative Processing Speed

1 Computation Time Element

Package 1 (slow and easy) 0.5 Package 2 (fast and hard)

Complexity 0 0 1 2 3 4 5 6 7 8 9 10 15 User Programming (Time and Effort) Element 10

5

0 0 1 2 3 4 5 6 7 8 9 10 4 Differences in Costs Total 2 Computation Programming 0

-2 0 1 2 3 4 5 6 7 8 9 10

Other parameterizations can be used and many will be just as reasonable, but as long as computation speed increases faster than programming costs fall over time, a comparison of ‘total costs’ for a task will become largely driven by the programming effort element. Given that the slower but more user-friendly packages were popular at computing speeds and capabilities far below today’s standards, it is reasonable to see the world of econometric selection as more like Chart 2 than like Chart 1. Furthermore, the range in which the easier but slower package will be preferred will tend to increase with each new computer generation. In sum, the analysis in this section can be seen as simply a confirmation of basic observations about the market for econometric software. There is no denying that easier-to-use and slower-than-average packages are popular, and they seem to be growing in popularity as their software developers add features and improvements, which in turn are supported by users that purchase the latest updates. This is not necessarily an obvious development, as another path could have dominated based on the fact that some users surely migrate to the faster and more flexible languages as they gain programming experience. These users probably care little about upgrades to the first econometric package that they learned to use, but it seems that the average user does care.

MORE COMPUTING SPEED TESTS Even with reservations about the relevance of computing speed, I present my own tests mainly to compare the speed of SAS to MATLAB, and to compare both packages to FORTRAN and C/C++. In these comparisons, two distinct speed tests were devised using different packages on two different computing platforms. The first test was a Black-Scholes call option valuation exercise, where 1 million values were computed on each platform/package combination. While the need to compute such a large number of option values is unrealistic for most application, having this large number of calculations helps to accentuate the speed differences. The second

7 NESUG 2007 Applications Big and Small

test involved combining data tables for stock returns, sorting the results, and computing CAPM beta on a per company stock basis. The first set of runs used a Sun V440 system running Solaris 9.0 (UNIX) with four 1 GHz processors (and 8 G of memory). This is the type of server or minicomputer system that might be shared by 5 to 10 users that are interested in a fairly high level of performance. The second set uses a basic MS-Windows PC with a 1.7GHz Pentium 4 processor that serves as a mid range single-user system. The results are shown in Figure 2.1 and all of the programs used were standardized to be comparable as possible and default settings were used for compiling and processing.

Fig. 2.1 BLACK-SCHOLES OPTION VALUATION FORMULA CALCULATION SPEEDS TIME IN SECONDS. 1 MILLION COMPUTATIONS.

System A System B Sun V440 Pentium 4 PC C Program 3.0 1.5 Fortran 4.1 -- Matlab 2.4 1.4 SAS 4.6 6.7 R -- 1.9 EXCEL-VBasic -- 560.0

The results show that a Pentium 4 PC is not just competitive with an average multi-processor system, it can perform better in some tests. The relationships among the packages across the two systems were not fixed, however. Most important, the fact that the times for a C program, MATLAB, and R were basically within tenths of second each other shows that absolute speed are only relevant for extremely computationally intensive tasks. Times for FORTRAN were higher than these three packages on the UNIX server system, and this may be due to the compiling settings. At the least, the FORTRAN result shows that it may take some experience to get the most out this language. The EXCEL-Visual Basic test results are far and away the worse, and this example was mainly included to show that not all selections for software are appropriate when speed matters. While SAS was second slowest, for the more common uses of this package that require less than many million calculations, the time performance of SAS is likely to be adequate. The second computational test involved computing regression statistics (CAPM beta type) for 500 stocks in the CRSP database, with the set of stocks selected as those that were members of the S&P 500 index at the end of December 2004. Daily data was used and each stock was treated separately for purposes of the regression, but extracted from a database and prepared as a group for the 500 regressions. The date range for each stock was limited to 10 years (1995-2004), with an additional constraint that no date range for a stock started before the company was a member of the S&P 500. This variable date range feature was desired to add a complication to the data extraction and preparation steps, and also the regression computation that makes this speed test more realistic than simply using the exact same date range for each stock.

8 NESUG 2007 Applications Big and Small

The results are reported in Figure 2.2, where it is shown that a FORTRAN program is clearly the fastest in an initial run. This conclusion can be somewhat misleading, however, given that the FORTRAN program needed to

Fig. 2.2 CAPM REGRESSION CALCULATIONS TIME IN SECONDS Regression Run 1 Run 3 Portion C Program 19.2 3.7 < 0.5 Fortran 3.4 3.1 < 0.5 Matlab 27.0 23.1 < 0.5 SAS 235.0 41.0 1.0

System: Sun V440. Time in seconds based on custom written programs that compute 500 regression equations with one explanatory variable and approximately 2000 observations per regression (961,000 total observations). Test total times include data access and preparation. Run 3 times allow for potential memory caching. The ‘< 0.5’ second designations were not measurable with more reliable precision.

be compiled as a first step, which adds about 10 seconds (but is not needed for subsequent runs). The C program takes considerably longer to both compile and make an initial run, but surprisingly the same program produces basically the same time as a FORTRAN run in follow up runs because the data was in the computer’s memory cache. In the any test of this type, it is clear that data access time overwhelms computational time, where disk reading speed (in which caching in the subsequent runs often avoids) can vary by even greater margins than the results shown above. Even the slowest run time from a SAS program is clearly acceptable given the necessary programming time that would range between 10 to 20 minutes, and could easily exceed 1 hour for inexperienced programmers, even under the best circumstances. Most important, the last column in Figure 2.2 shows that the econometric calculations (regression portion) are effectively equivalent across the four packages. While the computation speed tests above are not fully representative of the ‘average’ econometric application, they are realistic examples for showing that computation speed is rarely relevant. In fact, it is not easy to find realistic or common cases that show noticeable differences. For example, when computing OLS coefficients for cases with less than 1 million observations and 100 explanatory variables, the time differences are usually even less than those reported above. And due to multi-core processors that have become de-rigueur on new PCs, any time differences will matter even less in the future.

MORE ON PROGRAMMING CONSIDERATIONS The programming time and effort varied greatly for the above tests, especially the second set of tests. Evaluating the effort or experience needed to write a particular program is admittedly subjective, but counting the lines of code needed to accomplish a particular task serves as a useful proxy for ranking the user friendliness of each program or package. In fact, computer scientists often use program line counts as an approximate measure of both the programming time cost and the complexity of the program, with a general assumption that programming time and programming line counts are roughly proportional. Below are the line counts for the programs used in the second test (CAPM regressions) above: FORTRAN 103 C 64 SAS 15 Matlab 20 These numbers are one way to quantify that C and FORTRAN programs can be many times harder to write than programs for packages in the second and third groups, such as SAS and MATLAB. While this is not the only way to show this fact, such an assertion is fairly uncontroversial. More important, the differences in programming time greatly dwarf the differences in run times, by a factor of at least 100:1. Thus, switching from SAS to FORTRAN or

9 NESUG 2007 Applications Big and Small

C is rarely a good decision. Switching from SAS to MATLAB (or another package in the second or third categories, and vice-a-versa), might be more reasonable, but based on an assessment that considers any addition time and costs that would be required to learn and use a new package, there are few compelling reasons for an experienced user of one package to switch. It is also fairly easy to show examples where a poorly designed program for a particular task will run more than 10 times slower than necessary, irrespective of the raw speed and capabilities of the programming language or statistical package. For example, the computation speed of the SAS and MATLAB programs used in the second test above would have increased substantially if ‘loops’ that emulated the FORTRAN and C program steps had been used to process the data and compute the regression statistics, as opposed to using the most efficient steps for these two programs. Such loops probably would have seemed natural to FORTRAN or C programmer that wanted to try and test a different package, but any experienced SAS or MATLAB user would have known a better method. Similarly, the FORTRAN and C programs used in the second test where based on well established and tested data access and computation algorithms, and other program designs could easily double or triple the computing time. For a discussion of this last issue, see Prechelt (2000), which presents results from using students to build distinct implementations of the same programming problem in seven different languages (C, C++, Java, Perl, Python, Rexx and Tcl). Prechelt then analyzed program length, programming effort, runtime efficiency, memory consumption, and reliability. One of his conclusion that is shared by many other computer scientists and experts in the field is that designing and writing a program in an interpreted, scripting language (such as Perl, Python, Rexx, or Tcl) can take less than half as much time as writing in complied language (such as C, C++, or Java) simply because they tend to be half as long or shorter. Still, Prechelt finds no clear differences in program reliability among the language groups, although script programs tend to consume about twice as much memory (as a C or C++ program) and run slower (on average). His most important finding might be: “For all program aspects investigated, the performance variability that derives from differences among programmers of the same language … is on average as large or larger than the variability found among the different languages.” In the final assessment of the programming issues, ‘implementation’ often matters more than ‘potential’ for computational speed. Program reliability or accuracy matters as well, which strongly suggests that one should not stray from a package that they know best without extremely good reasons.

THE CAPABILITIES OF SAS Among the programming advantages of SAS are its ability to handle both small and large data sets in an equally adept manner, with the base package offering numerous data import and export options. For basic data transformations, SAS ‘DATA steps’ are relatively easy to write and follow. In addition, PROC EXPAND can handle complicated data transformation such a weighted-moving averages and other data filters (such as Hodrick- Prescott) in a single line of code. SAS also has fast sorting and record-selection capabilities, plus a built in ability to handle SQL syntax to ‘join’ tables or datasets. If SAS had few of the tools and procedures that are needed by most econometricians, the fact SAS programs are much easy to write than FORTRAN and C programs would not be relevant, and neither would similar run time speed be relevant. But as all experienced SAS users know, this is not the case. Often econometricians that complain about SAS statistical capabilities are only aware of the basic OLS-based PROC REG and PROC GLM estimation procedures, and are unaware of the capabilities of PROC MODEL, PROC NLIN, PROC SYSLIN, PROC LOGISTIC, PROC MIXED, and newly-offered PROC GLMMIXED. A search through SAS’s extensive on- line documentation, which is another undeniable strength of SAS, would soon convince any experienced econometrician that SAS has procedures and tests for almost all conceivable econometric applications. SAS does have a few relative disadvantages and limitations. For pure time-series analysis, packages such as EVIEWS and TSP are somewhat easier to use because they were developed with an intentional ‘time-series’ orientation. Still, SAS can handle almost all basic time-series tasks, and the more general data capabilities of SAS shine in combined cross-sectional and time-series analysis. Two areas in which I would not recommend SAS, however, are general non-linear optimization and Monte-Carlo simulation. This assessment is not based on finding SAS to be flawed in these areas, but because other packages such as MATLAB are much better at these types of tasks. The greatest deficit for SAS is its Macro Language Facility, which is no match for the ease in which custom modules and functions can be written in other languages. While SAS has PROC IML that can be used to create custom procedures in a matrix-based language, MATLAB provides a good example of a programming

10 NESUG 2007 Applications Big and Small

environment that is much more flexible and intuitive to new users (as long as they are comfortable with matrix algebra concepts). In fact, almost all packages in the third category are significantly more powerful and flexible because most use object-oriented programming concepts. While a full discussion of the disadvantages of traditional programming structures and SAS’s somewhat idiosyncratic programming structure is beyond the scope of this paper, it is instructive to know that none of newer packages emulate the SAS programming structure in any in any noticeable fashion. Furthermore, many of the packages in the third category have clear commonalities among themselves, and most can ‘call’ procedures and modules written in another language. While SAS is limited in this regard, for econometricians who find the basic SAS estimation and test procedures to be fully sufficient, this is not an issue. Still, it has been long recognized that translating newly-developed formulas and equations in journal articles into working computer code is one of the best ways to understand the underlying theory (and maybe even more important, to determine whether the justifying conditions are appropriate for a particular application). It can not be denied that packages such as MATLAB and GAUSS are much stronger in this area. Finally, it is important to know that SAS and the other packages in the second and third categories are not an ‘either-or’ situation. SAS’s integration of Open Database Connectivity (ODBC) methods for connecting database management systems (DBMS) makes it relatively easy to use multiple packages with large datasets. File-based methods for transferring data are often even easier to implement for small and medium-size datasets. If a programmer finds SAS limiting in a certain step of project or procedure, and another packages has the solution, it is usually not difficult to move data back and forth between SAS and another package. Thus, an econometrician/programmer can take a pragmatic approach and search for the best tool among the packages that they know and can easily access. In other words, one might use SAS for basic regression estimation and panel-type analysis and another package for non-linear optimization, Monte-Carlo simulation, or simply to compute a single, specially-developed test statistic.

CONCLUSIONS The point and issues that are presented in this paper—i.e., comparisons and evaluations, opinions and conclusions, and admittedly a few cases of random thoughts—is far from the last word. Still, a strong case is made for believing that PC technology has advanced to a level that makes computational speed irrelevant for most users of econometric software. Exceptions are difficult nonlinear optimization problems, evaluations of equations that require multi-dimensional integration, extremely large and computational intensive simulations, and bootstrap exercises that require many millions of calculations. For 90% of all econometric software users, however, over 90% of their applications will fall below the level of calculating complexity where computer speed matters in any real sense. Even econometric models with a million observations and a few hundred equations can be solved or estimated on any reasonable system in a fraction of the time that it would take to port to a ‘fast’ system. Still, the pace of progress and innovation in terms of both computer performance and software packages shows no signs of slowing. Thus, it would not be surprising that some aspects of this analysis are extremely dated in a few years. Hopefully, the general framework that is presented here for evaluating different econometric packages will still be useful. For example, I expect user friendliness and the programmer’s personal preferences to remain the overriding factor in a package’s selection. Nonetheless, there are cases where forcing a program or statistical package to perform a task for which it is not well suited is not a good choice, simply because other easy-to-use and powerful tools are readily available. In the end, it much better look at why SAS is used and how it is best used, as opposed to holding on to prior beliefs and practices, and to be ready to take advantage of the new developments in both econometric theory and econometric software.

REFERENCES Lutz Prechelt, 2000, An Empirical Comparison of Seven Programming Languages, IEEE October 2000.” MacKie-Mason, Jeffrey K, 1992. Econometric Software: A User's View, Journal of Economic Perspectives, American Economic Association, vol. 6(4), pages 165-87, Fall 1992. Renfro, Charles, 2004, A Compendium of Existing Econometric Software Packages, Journal of Economic and Social Measurement 29 (2004) 359–409. Rycroft, R. S., 1999, Microcomputer Software of Interest to Forecasters in Comparative Review: Updated Again, International Journal of Forecasting 15 (1999), pp. 93–120.

11 NESUG 2007 Applications Big and Small

Steinhaus, Stefan, 2002, ‘Comparison of Mathematical Programs for Data Analysis’ http://www.scientificweb.de/ncrunch

ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. MATLAB is a registered trademark of The Mathworks, Inc. GAUSS is a registered trademark of Aptech Systems, Inc. Stata is a registered trademark of StataCorp LP. Other brand and product names may be trademarks of their respective companies.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Michael D. Boldin Wharton Research Data Services The Wharton School, The University of Pennsylvania 216 Vance Hall 3733 Spruce Street Philadelphia PA 19104-6301 [email protected] * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

12