EFFICIENT USE of TABLE LOOKUP PROCEDURES Craig Ray & Howard Levine ORI, Inc
Total Page:16
File Type:pdf, Size:1020Kb
HENDERSON I Table Lookup EFFICIENT USE OF TABLE LOOKUP PROCEDURES Craig Ray & Howard Levine ORI, Inc. NOTE: This paper was written by Craig Ray and Howard Levine , colleagues of Don Henderson at ORI, Inc. ORI personnel often deliver papers at conferences written by other members of the staff. Abstract Table lookup procedures (i.e., cross referencing a lookup file based on the value of a variable in the main file) are generally rather simple and easily performed using SAS. Both the MERGE statement and PROC FORMAT with the PUT function are commonly utilized for this purpose. However, if the main file is large, MERGE can be very inefficient since it requires sorting and a data step for each lookup and finally another sort. Likewise, if the lookup file is very large, using PROC FORMAT becomes prohibitively expensive. Two other table lookup techniques are discussed here. One uses MACRO variables and is similar to PROC FORMAT in its advantages and disadvantages. The other is a binary search. Other techniques can be found in Reference 1. Discussion The effects of file size on the various methods of performing table lookups are presented in the figures accompanying this text. Each approach is described and its advantages and disadvantages are shown along with a "rule of thumb" to tell when the technique is appropriate. It may be advisable to refer to the figures before continuing with the text. Using MACRO variables may require a lot of core; this technque requires one macro variable for each record in the lookup file. Therefore, as with PROC FORMAT with the PUT function, it is not wise to use this technique for large lookup tables. The strong point of using macro variables is that minor changes can easily be made to the lookup table during execution of a SAS program. The disadvantage with respect to using PROC FORMAT is that several lookup tables may have to stored in the same macro symbol table, slowing each lookup. Caution must be excersised preventing symbol tables from becoming unwieldy. A binary search routine will perform well regardless of the size of either file. A binary search successively divides the range of records to be searched in the lookup file in half 32 HENDERSON I Table Lookup until the sought record is found. As the number of records in the lookup file increases, the binary search accelerates in efficiency relative to all of the other techniques. There are several factors that should be taken into consideration when performing table lookups. Among them are programmer time required to prepare the software and the size and composition of the main and lookup files. For small files, it is usually not worth the programmer time required for software design and preparation to use anything except a simple technique such as MERGE. But when processing large files or setting up a data base management system, size and composition of your files is the primary consideration. One execution of a program with a poorly designed table lookup technique can easily cost more than a programmer for a day. By knowing the size and composition of your data bases, you can devise methods to exploit your knowledge of the data and consequently reduce computer costs by optimizing table lookups. Because every file is different, this paper shows a simple example of exploiting the composition of your files to achieve maximum efficiency in table lookup. An evaluation should be done on a case by case basis. This properly illustrates general techniques whose efficiency is determined by the size of the files more than any other factor. Even without optimizing for file composition, our guidelines can provide dramatic improvements in efficiency over MERGE and PROC FORMAT when large files are being used. Some general guidelines to follow in order to exploit the composition of files are given. Strive to mimimize the number of conversions, string operations, assignment statements, and reading of observations. For example, by noting that a key is numeric and ranges between 100000 and 109990 with the last digit 0 at all times, a new key can be defined based on the third, fourth, and fifth digits ofthe old key. Since the key is numeric, perhaps the new key can be used to directly access an observation on the lookup table using the POINT option on the SET statement. This gives a basic idea of how file composition can be exploited to improve efficiency. Cases with large files are, by far, the most interesting to examine because those are the situations where efficiency is the most critical. One fact that becomes clear from looking at the figures is that data bases employing large lookup tables (of size 10,000 or more) cannot use formats for table lookups. Macro variables are just as restrictive. Except when MERGE BY in a data step is appropriate, the only remaining techniques available are binary search and enhanced binary searches. Binary search is an excellent algorithm by itself, and it is ideal when the main file is small and the lookup file is large. But with large main files as well as large lookup files, it becomes obvious from considering an example that enhancements to the binary search can significantly improve efficiency. A binary search routine is enhanced when the range of observations considered is limited. This paper presents one general example of how to do this. When one has an understanding of the program, modifications can be made to achieve maximum efficiency based on the structure of the key to the lookup file. 33 HENDERSON I Table Lookup . Consider the following example: a main file has 100,000 records and the lookup file has 32,000 records. A binary search requires an average of INT(LOG2{32000)) searches per record in the main file. That means the lookup table will be read with a set statement about {14 sets per main file record) * (100,000 main file records) = 1.4 million sets of lookup table observations. If we are able to make a directory of size 1000, then, on average, we can reduce the number ofset statements addressing the lookup table INT(LOG2(1 000)) which is about 10 lookup file set statements per main file observation. This reduces the number of sets of the lookup file to 400,000 --down 1,000,000. With a savings of this magnitude, the preprocessor step required to establish a directory is clearly worthwhile. Examples of creating directories can be found in Reference 2. REFERENCES Henderson, Don, 'Table Lookup Techniques', SUGI'82 Howard, Neil and Pickle, Unda, 'Efficient Data Retrieval -Direct Access Using the Point Option', SUGI '84 The authors can be contacted at: ORI, Inc. 122C St., NW Suite 250 Washington, D.C. 20001 {201) 737-2666 34 HENDERSON I Table Lookup FIGURES INTROPUCTION Table lookup procedures are used for cross-referencing a lookup file based on the value of a variable in a main file define: main file- the main being processed, reported on, etc. lookup file -file containing desired information for the main file. This file is not necessarily in the same sort order as the main file key- The variable (or variables) in common between the main file and the lookup file. Variables in the lookup file are looked up by this group of variables. result of lookup - the variables ultimately desired from the lookup file. SAMPLE PATA MAIN FILE NAME ZIP POLITICAL PARTY ADAMS, JOHN 29845 R COLUMBUS, CHRIS 12634 I COOPER, PAULA 65729 R DALTON, JAMES 37490 D DEBBS, EUGENE 83726 s LORAN, NANCY 13245 D MARX, KARL 17329 c PORTER, ALAN 88123 R SOBER, TOM 14231 D THORPE, MARTHA 61524 I 35 HENDERSON I Table Lookup LOOKUP FILE (COWS/CAPITA) (ASSEMBLYMAN) ZlP COUNTY POP COWS ASSEMBLY PARTY 00010 MANHATTEN 2123654 0 WARNER D 02845 WESTCHESTER 1534234 .13 DIXON R 14623 MONROE 712983 1.03 CHARLES R KEY - ZIP RESULT OF LOOKUP- any combination of other LOOKUP FILE variables MERGE This technique is simply utilizing SAS MERGE ADVANTAGES: • Easy to Code • Result of lookup may have any number of variables, each of any length allowed by SAS DISADVANTAGES: • May have to sort main file, then merge, then resort main file • Need to visit every record of the lookup file (but just once). • multiple lookups (presumably on different keys and perhaps on different lookup files) need multiple steps. CONCLUSION: Use if: 1) Main file and lookup file are already sorted by the same keys and lookup file is not too large with respect to the main file. or 2) Sort, merge, and sort wiH not expend too much CPU time (to be determined by the user). 36 HENDERSON I Table Lookup CODE USING MERGE TO LOOKUP pop FOR EACH RECORD OF MAIN 1) proc sort data= main; by zip; run; 2) data find; merge main(in=inmain) lookup(keep=zip pop); by zip; if inmain; run; 3) proc sort data=find; by name; run; NOTE: If multiple lookups are needed (with different lookup files), steps 1 and 2 would have to be repeated. PROC FORMAT - PROC FORMAT is efficiently used as a table lookup procedure - Need a tool to convert a SAS data set into a format table ADVANTAGES: * Once format table is created, lookup is extremely fast. * Many lookups on different keys can be done in one data step. * Doesn't require input data set to be sorted DISADVANTAGES: *The cost of creating the format table relates exponentially to the size of the lookup file. * The result of lookup is limited to 40 characters. CONCLUSION: Use if: 1) lookup file is not too large (less than 10,000 records).