Sas®, Econometrics, and Computation Speed Michael D

NESUG 2007 Applications Big and Small SAS®, ECONOMETRICS, AND COMPUTATION SPEED MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA ABSTRACT This paper discusses how researchers can take advantage of SAS and the ever-growing power of personal computers (PCs). Due to previously unimaginable processing speeds and the overall capabilities of relatively inexpensive PCs, many procedures that were once considered computationally prohibitive are now within the reach of almost any one that desires to use a preferred econometric technique—and SAS is likely to have the desired capabilities. Still, there are cases where forcing any statistical package such as SAS to perform a task for which it is not well suited is not a good choice, simply because other easy-to-use econometrics tools are readily available. INTRODUCTION Modern personal computers (PCs) have processing and storage capabilities that make the first few vintages of this technology seem toy-like in retrospective. In fact, inexpensive PCs now rival the power that could only be found in what was called a ‘supercomputer’ just a few years ago. Therefore, empirical researchers in various fields, including all areas of the social and physical sciences, and specialties such as pharmaceuticals and financial economics, have many options for data management and statistical analysis. This paper discusses the extent in which SAS takes advantages of new technology and remains one of the more user-friendly options for general econometrics. Because the speed in which data can be manipulated and statistics can be computed has increased over time, many procedures that were once considered computationally prohibitive are now well within the reach of almost any one that desires to use a particular econometric technique. Furthermore, extremely large databases can be stored in SAS format on a single-user system at a relatively low cost, or just as easily and cheaply, the same data can be made available to many users through either a local network or the internet. In fact, due to less costly and more powerful computing technology, packages such as SAS no longer have large computational handicaps when compared to more general computing languages (such as FORTRAN and C/C++). In other words, for most econometric applications, the speed of a SAS program is effectively no different than the speed of custom-written FORTRAN or C program. At the same time, the technical expertise that was once required to write custom programs in other computer languages has declined considerably. Not only have programming interfaces become more user friendly, many of the alternatives to SAS are fairly standardized (among themselves) and differences from the SAS programming environment are widening. Therefore, SAS users may feel (rightly or wrongly) that they can not take advantage of new developments in computer languages and data manipulation tools, especially object-oriented programming concepts and the open-source software movement. In discussing these issues and the advantages of SAS, I consider how advances in computer technology affect the capabilities and the use of any econometric software package. I also try to place SAS in a perspective that recognizes why it became it became popular as an alternative to traditional programming and how it compares other options on a forward-looking basis. This paper is not a review or ranking of statistical packages in the conventional sense, however. The main goal is present a framework for making choices across types of software packages, such as a decision to use a general programming language for econometrics, or instead a mature statistical package such as SAS, or perhaps alternatives such as MATLAB. The MATLAB case is most interesting because it has gained popularity among econometricians that need to employ non-standard econometric techniques or prefer to use custom-written procedures. While SAS has a module that has similarities with the MATLAB environment, its use seems to be very low even among dedicated SAS users. The general conclusion of this analysis is that user friendliness and the programmer’s personal preferences will remain the overriding factor, such that different choices by researchers with different backgrounds are completely reasonable. Still, there are cases where forcing a program or statistical package to perform a task for which it is not well suited is not a good choice, and a comparison of the different programming options can help to determine the best choice. In the final analysis, technological progress on both the hardware and software side allows researchers to use the tools that they find to be the most comfortable to use, and they are not forced into a one- solution-fits-all world. Moving forward, SAS should continue to be used by serious econometricians with diverse interests and needs, and also by more casual empirical researchers in many different fields. 1 NESUG 2007 Applications Big and Small ECONOMETRIC PACKAGE CATEGORIES It is not possible to evaluate all of the important issues for all econometric packages that could be chosen. Various econometric software lists on the internet that have 60 to 100 entries (although these lists tend to double- count a few versions of different packages and include defunct packages), and there are at least 30 packages that could be considered substitutes for the most used features of SAS. It is not even feasible to compare all aspects of a few popular packages. It is possible, however, to identify three fairly distinct categories of computer languages or software packages for data analysis and applied econometrics: 1. Traditional programming languages (examples: Basic, C/C++, FORTRAN, and Java). 2. Statistical packages (examples: EVIEWS, SAS, STATA, and TSP). 3. Matrix algebra oriented and other mathematical language tools (examples: GAUSS, Mathematica, MATLAB, R and Splus). For the taxonomy above, a distinction can be made between the package and any user-supplied programs. Here, the term ‘package’ is used to denote any self-contained software application, such as SAS. The package concept includes any module that is typically part of a basic or default installation, or is popular with econometricians, such as SAS’s statistics (STAT) and econometric time series (ETS) modules. Similarly, a FORTRAN package would include both the compiler and ‘libraries’ that contain functions that are most commonly used by FORTRAN programmers (such as LINPACK and GQOPT). The term ‘program’ is reserved for user-written applications that may or may not use the package’s library of functions. In this context, ‘code’, ‘script’, and ‘program’ are all synonymous as the user-written program files. In some cases, however, user-written code for data manipulation and analysis has evolved into routines and libraries that are either commonly shared or sold as add-on package (and in some cases in more than one packages, such as the IMSL routines), and the distinction between the package and user-written code is not always a sharp distinction. All of the packages listed above make use of higher-level language concepts that abstract from the low-level assembly and machine languages that give computer s their actual instructions. High-level languages use English words such as IF, THEN, ELSE, DO, FOR, WHILE, OUTPUT and PRINT to represent logical choices, steps, and specific actions, which are then translated into assembly and machine language commands using intermediary programs knows as ‘compilers’, which translate the entire program before executing its instructions, and ‘interpreters’, which translate and execute programs in stages (often line-by-line). The alternative is to write assembly or machine language commands directly, which would require detailed knowledge of the computer’s structure and memory registers and would make the estimation of even simple econometric equations almost impossible for anyone that is not a dedicated computer scientist. FORTRAN and C/C++ are clearly the premier cases for the first category. FORTRAN was invented in the mid- 1950s, and some call it the ‘language that refuses to die.’ Most important, it is still know for its speed, mainly because computer scientist continue to work on optimizing FORTRAN compilers, and not because the languages original design was so strong. In fact, FORTRAN syntax is often cursed by new users because it is difficult to learn and master. C/C++ is noteworthy because of its embrace of the object-oriented programming paradigm, and most of the packages in the second and third categories were written in this language. This suggests, but does not necessarily prove, that C/C++ it is the most powerful and flexible programming tool available. SAS is a good representative for the second group because it is the most popular and mature package in this category. SAS can be readily used by almost anyone to manage data, estimate models, and run statistical tests without any formal training in computer programming. The basic structure and conceptual framework for SAS programs has not changed greatly since it was introduced commercially in the 1970s, but the number of systems on which it can be used, its user interface, and its power in managing databases has grown considerably, and for the better. Currently SAS has at least ten serious competitors, with Stata probably having the second largest group of users in this category. Still, SAS’s user base dwarfs all of its competitors in both the commercial and academic arenas. I consider MATLAB to be the most popular and best representative of the third group for econometric analysis, but not all would agree with this assessment. MATLAB was introduced in the 1980s (as was its main commercial competitor: GAUSS), as a way to allow non-programmers to write code that uses concise, matrix algebra based formulas. Irrespective of MATLAB’s role as a leader in the group, this category seem to be expanding faster than the other categories, both in the number of serious packages and serious users.

Load more