HENDERSON I Table Lookup

EFFICIENT USE OF TABLE LOOKUP PROCEDURES

Craig Ray & Howard Levine ORI, Inc.

NOTE: This paper was written by Craig Ray and Howard Levine , colleagues of Don Henderson at ORI, Inc. ORI personnel often deliver papers at conferences written by other members of the staff.

Abstract

Table lookup procedures (i.e., cross referencing a lookup file based on the value of a variable in the main file) are generally rather simple and easily performed using SAS. Both the MERGE statement and PROC FORMAT with the PUT function are commonly utilized for this purpose. However, if the main file is large, MERGE can be very inefficient since it requires sorting and a data step for each lookup and finally another sort. Likewise, if the lookup file is very large, using PROC FORMAT becomes prohibitively expensive. Two other table lookup techniques are discussed here. One uses MACRO variables and is similar to PROC FORMAT in its advantages and disadvantages. The other is a binary search. Other techniques can be found in Reference 1.

Discussion

The effects of file size on the various methods of performing table lookups are presented in the figures accompanying this text. Each approach is described and its advantages and disadvantages are shown along with a "rule of thumb" to tell when the technique is appropriate. It may be advisable to refer to the figures before continuing with the text.

Using MACRO variables may require a lot of core; this technque requires one macro variable for each record in the lookup file. Therefore, as with PROC FORMAT with the PUT function, it is not wise to use this technique for large lookup tables. The strong point of using macro variables is that minor changes can easily be made to the lookup table during execution of a SAS program. The disadvantage with respect to using PROC FORMAT is that several lookup tables may have to stored in the same macro symbol table, slowing each lookup. Caution must be excersised preventing symbol tables from becoming unwieldy.

A binary search routine will perform well regardless of the size of either file. A binary search successively divides the range of records to be searched in the lookup file in half

32 HENDERSON I Table Lookup until the sought record is found. As the number of records in the lookup file increases, the binary search accelerates in efficiency relative to all of the other techniques.

There are several factors that should be taken into consideration when performing table lookups. Among them are programmer time required to prepare the software and the size and composition of the main and lookup files. For small files, it is usually not worth the programmer time required for software design and preparation to use anything except a simple technique such as MERGE. But when processing large files or setting up a data base management system, size and composition of your files is the primary consideration. One execution of a program with a poorly designed table lookup technique can easily cost more than a programmer for a day. By knowing the size and composition of your data bases, you can devise methods to exploit your knowledge of the data and consequently reduce computer costs by optimizing table lookups.

Because every file is different, this paper shows a simple example of exploiting the composition of your files to achieve maximum efficiency in table lookup. An evaluation should be done on a case by case basis. This properly illustrates general techniques whose efficiency is determined by the size of the files more than any other factor. Even without optimizing for file composition, our guidelines can provide dramatic improvements in efficiency over MERGE and PROC FORMAT when large files are being used.

Some general guidelines to follow in order to exploit the composition of files are given. Strive to mimimize the number of conversions, string operations, assignment statements, and reading of observations. For example, by noting that a key is numeric and ranges between 100000 and 109990 with the last digit 0 at all times, a new key can be defined based on the third, fourth, and fifth digits ofthe old key. Since the key is numeric, perhaps the new key can be used to directly access an observation on the lookup table using the POINT option on the SET statement. This gives a basic idea of how file composition can be exploited to improve efficiency.

Cases with large files are, by far, the most interesting to examine because those are the situations where efficiency is the most critical. One fact that becomes clear from looking at the figures is that data bases employing large lookup tables (of size 10,000 or more) cannot use formats for table lookups. Macro variables are just as restrictive. Except when MERGE BY in a data step is appropriate, the only remaining techniques available are binary search and enhanced binary searches. Binary search is an excellent by itself, and it is ideal when the main file is small and the lookup file is large. But with large main files as well as large lookup files, it becomes obvious from considering an example that enhancements to the binary search can significantly improve efficiency. A binary search routine is enhanced when the range of observations considered is limited. This paper presents one general example of how to do this. When one has an understanding of the program, modifications can be made to achieve maximum efficiency based on the structure of the key to the lookup file.

33 HENDERSON I Table Lookup .

Consider the following example: a main file has 100,000 records and the lookup file has 32,000 records. A binary search requires an average of INT(LOG2{32000)) searches per record in the main file. That means the lookup table will be read with a set statement about {14 sets per main file record) * (100,000 main file records) = 1.4 million sets of lookup table observations. If we are able to make a directory of size 1000, then, on average, we can reduce the number ofset statements addressing the lookup table INT(LOG2(1 000)) which is about 10 lookup file set statements per main file observation. This reduces the number of sets of the lookup file to 400,000 --down 1,000,000. With a savings of this magnitude, the preprocessor step required to establish a directory is clearly worthwhile. Examples of creating directories can be found in Reference 2.

REFERENCES

Henderson, Don, 'Table Lookup Techniques', SUGI'82

Howard, Neil and Pickle, Unda, 'Efficient Data Retrieval -Direct Access Using the Point Option', SUGI '84

The authors can be contacted at:

ORI, Inc. 122C St., NW Suite 250 Washington, D.. 20001 {201) 737-2666

34 HENDERSON I Table Lookup

FIGURES

INTROPUCTION

Table lookup procedures are used for cross-referencing a lookup file based on the value of a variable in a main file

define: main file- the main being processed, reported on, etc.

lookup file -file containing desired information for the main file. This file is not necessarily in the same sort order as the main file

key- The variable (or variables) in common between the main file and the lookup file. Variables in the lookup file are looked up by this group of variables.

result of lookup - the variables ultimately desired from the lookup file.

SAMPLE PATA

MAIN FILE

NAME ZIP POLITICAL PARTY

ADAMS, JOHN 29845 R COLUMBUS, CHRIS 12634 I COOPER, PAULA 65729 R DALTON, JAMES 37490 D DEBBS, EUGENE 83726 s LORAN, NANCY 13245 D MARX, KARL 17329 c PORTER, ALAN 88123 R SOBER, TOM 14231 D THORPE, MARTHA 61524 I

35 HENDERSON I Table Lookup

LOOKUP FILE

(COWS/CAPITA) (ASSEMBLYMAN) ZlP COUNTY POP COWS ASSEMBLY PARTY

00010 MANHATTEN 2123654 0 WARNER D 02845 WESTCHESTER 1534234 .13 DIXON R 14623 MONROE 712983 1.03 CHARLES R

KEY - ZIP

RESULT OF LOOKUP- any combination of other LOOKUP FILE variables

MERGE

This technique is simply utilizing SAS MERGE

ADVANTAGES: • Easy to Code • Result of lookup may have any number of variables, each of any length allowed by SAS

DISADVANTAGES: • May have to sort main file, then merge, then resort main file • Need to visit every record of the lookup file (but just once). • multiple lookups (presumably on different keys and perhaps on different lookup files) need multiple steps.

CONCLUSION: Use if:

1) Main file and lookup file are already sorted by the same keys and lookup file is not too large with respect to the main file.

or 2) Sort, merge, and sort wiH not expend too much CPU time (to be determined by the user).

36 HENDERSON I Table Lookup

CODE USING MERGE TO LOOKUP pop FOR EACH RECORD OF MAIN

1) proc sort data= main; by zip; run;

2) data find; merge main(in=inmain) lookup(keep=zip pop); by zip; if inmain; run;

3) proc sort data=find; by name; run;

NOTE: If multiple lookups are needed (with different lookup files), steps 1 and 2 would have to be repeated.

PROC FORMAT

- PROC FORMAT is efficiently used as a table lookup procedure - Need a tool to convert a SAS data set into a format table

ADVANTAGES: * Once format table is created, lookup is extremely fast.

* Many lookups on different keys can be done in one data step.

* Doesn't require input data set to be sorted

DISADVANTAGES: *The cost of creating the format table relates exponentially to the size of the lookup file.

* The result of lookup is limited to 40 characters.

CONCLUSION: Use if:

1) lookup file is not too large (less than 10,000 records). and 2) Merge is not appropriate (for reasons stated above) and 3) Result of lookup no more than 40 characters

37 HENDERSON I Table Lookup

MACRO WHICH CREATES FORMAT TABLE FROM SAS DATA SET

%macro makefmt( data=, r sas data set *I key=, r key variable for format; concatenated if mult. keys*/ result=, r result of lookup; concatenated if mult. variables *I format=, r name of the format table; prefix with$ if char. format *I ); options dquote; data _null_; length &result $40; r truncate to 40 chars if more*/ do until(lastrec); set &data end=lastrec; _cntr+1; _ccntr=left(put(_cntr,S.));

%if o/osubstr(&format, 1 ,1) = $ o/othen o/odo r character format*/ call symput('_vl' _ccntr, "" &key ""); %end;

%else o/odo; r numeric format *I call symput('_v1' _ccntr,&key); %end;

call symput('_lb' _ccntr,"" &result ""); end; call symput('entries' ,_ccntr); stop; run;

proc format;

value &format o/odo i=1 %to &entries; &&_v1 &i = &&_lb&i %end; run;

%mend makefmt;

38 HENDERSON I Table Lookup

ALTERNATIVE CODE TO CREATE FORMAT TABLE FROM SAS PAJA SET options dquote; data _null_;

file temp; put 'proc format;' I 'value zipfmt'; do until(lastrec);

set lookup end=lastrec; put zip 5. '=' '"' pop 9. "";

end; put';' stop; run;

%include temp;

Note: while this code is easier to follow than the macro approach presented above, its drawback is that it requires writing code to an external data set.

CODE USING PROC FORMAT TO LOOKUP pop FOR EACH RECORD OF MAIN

o/omakefmt(data=lookup, key=zip, result==pop, format=zipfmt)

data find; set main; pop=put(zip,zipfmt.); run;

39 HENDERSON I Table Lookup

ENHANCEMENT TO PROC FORMAT LOOKUP

If result of lookup can not conveniently fit in a format label {e.g. greater than 40 characters), then create a format table for pointers to lookup file. data pointers(keep .. zip n); set lookup; n.. _n_ r n is record pointer to lookup file*/ run; o/omakefmt(data=pointers, key ..zip, result=n, format=lookfmt) data find; set main; pointer= input(put(zip,lookfmt.),S.); set lookup point=pointer; /*this is the lookup*/

etc. (more code if needed)

MACRO VARIABLES

·Very similar advantages and disadvantages as PROC FORMAT • Need a tool to create macro variables and retrieve them at will.

ADVANTAGES: • Doesn't require sorted lookup file * Result of lookup may be up to 200 chars * Once macro variables are created, lookup is extremely fast * Lookup table is easily altered

DISADVANTAGES: *Large lookup tables will require vast amounts of core * Adds macro variables to the symbol table • slows down processing of other macro vars • Macro variables can't be stored for other job steps * Not modular; care must be taken when naming other macro variables * Forces the value of the key to be 7 vars or less

40 HENDERSON I Table Lookup

CONCLUSION: Use if:

* lookup file is not too large (less than 2000 records) and * Merge is not appropriate and * Altering of table is needed between steps (i.e. not able to do with PROC FORMAT)

MACRO WHICH CREATES MACRO VARIABLES FROM SAS DATA SET

%macro mactbl( data=, r sas data set*/ key=, r key for lookup *I . result=, I* result of lookup *I prefix=, /*letter identifying this lo table from other macro vars */ );

options dquote;

data _null_;

do until(lastrec);

set &data end=lastrec; call symput(("&prefix" &key , &result);

end;

run;

%mend mactbl;

CODE USING MACRO VARIABLES TO LOOKUP POP FOR EACH BECORP OF MAIN

o/omactbl(data=lookup, key=zip, result=pop, prefix=p)

data find; set main; pop=symget ('p' zip); run;

41 HENDERSON I Table Lookup

BINARY SEARCH

DEFN. A binary search successively divides the range of records to be searched in half until the sought record is found.

ADVANTAGES: *Performs well regardless of the size of the lookup file or main file. *The result of the lookup may have any number of variables, each of any length allowed by SAS

DISADVANTAGES: The lookup file must be sorted by its key

CONCLUSION: Use if:

1) lookup file Is large i.e., PROC FORMAT is not appropriate

and/or 2) main file and lookup file have different sort orders.

CONCEPT OF BINARY SEARCH

EXAMPLE: SEARCH FOR SPOCK

APPLE BABOON CHICAGO DOOR HOUSE JEEP LOUSE MAIL ** NAPLES NAPLES PIZZA PIZZA QUILT QUILT ROOM ROOM ** SPOCK SPOCK SPOCK SPOCK ** TIRE TIRE TIRE ** WATER WATER WATER

Note: SPOCK found in 4 SET statements.

In general, for a lookup file of size N:

* The maximum number of SET statements is INT(LOG2(N)) + 1

* The average number of SET statements is INT(LOG2(N))

42 HENDERSON I Table Lookup

BINARY SEARCH MACRO

%macro bsearch( table=, r name of lookup file*/ key=, r key variable- concatinated if multiple variable key*/ getvars=, r result of lookup*/ first=,last= r used with a directory *I );

_found= O; r variable indicates search successful*/ _stop=O;

r restrict range of obs searched if required*/

%if &first eq %then _first = 1; %else _first= &first;

%if &last eq o/othen _last=_total; %else _last = &last;

r perform binary search using set with point*/ do until(_stop);

mid= int(Lfirst +_last) /2); set &table (rename=(&key=_target) keep=&getvars &key) point=_mid nobs=_total; if &key = _target then do; r found desired record*/

_stop= 1 _found= 1;

end; r found desired record *I else do; r continue searching*/

if &key = _target then do; r search lower sublist */ _last = _mid - 1 ; end; r search lower sublist */

43 HENDERSON I Table Lookup

else do; r search upper sublist */ _first= _mid+ 1; end; r search upper sublist */

if _first= _last then r record not on file*/ _stop= 1; r record not on file */ end; r continue searching*/ end;

%mend bsearch;

CODE USING BINARY SEARCH TO LOOKUP pop FOB EACH RECORD IN MAIN

data find; set main; %bsearch(table=lookup, key=key, getvars=pop) run;

ENHANCEMENT TO BINARY SEARCH:

DIRECTORIES

The concept is to limit the range of observations to to be looked at by binary search.

ADVANTAGES: *Will require on average INT(LOG2(n)) fewer SET statements for each lookup then the binary search without a directory,

where n = number of obs in the directory

DISADVANTAGES: * The savings stated in ADVANTAGES may not justify the cost of the preprocessing step which is required to create the directory. The preprocessing step requires passing through the lookup file once.

44 HENDERSON I Table Lookup

CODE TO CREATE A DIRECTQRY TO LOOKUP

PROGRAM DESCRIPTION:

*Transform the key, ZIP, into a partial key. In this case, the partial key will be the first 3 characters of ZIP.Then output one record for each partial key to data set POINTERS. POINTERS contains the first observation of LOOKUP which contains the particular partial key, concatenated with the last observation of LOOKUP which contains the particular partial key.

*Create a format table from POINTERS using %MAKEFMT. Notice that o/oMAKEFMT , becomes a soft»:@r§ ~ol which can be built upon to create more advanced lookups.

*For each record of main, obtain FIRST and LAST from the format table using PUT function and pass these variables to o/oBSEARCH to restrict the range of observations searched by BSEARCH on each pass.

CODE TO CREATE A DIBECTQBY TO LOOKUP

data pointers(keep=savepart t_n_l rename=(savepart=part_key)); length part_key savepart $3; retain first 1 savepart; set lookup end = lastrec; part..key = substr(put(zip,5.),1,3); if _n_ = 1 then savepart = part_key; else if part_key ne savepart then do; r create new obs in directory *I t_n_l = put(first,S.) ',' putln_-1 ,5.); output pointers; first= _n_; savepart = part..key; end; r create new obs in directory*/ if lastrec then do; r put out last obs in directory*/ t_n_l = put(first,S.) ',' putln_,S.); output pointers; end; r put out last obs in directory *I run;

45 HENDERSON I Table Lookup o/omakefmt( data=pointers, key=part_key, result=f_nJ, format=direct) data find; set main; f_n_l = put(substr(put(zip,5.), 1.3),direct.); first= input(scan(f_nJ, 1 ,' ,'),5.); last = input(scan(t_n_1,2,','),5.); o/obsearch(table=lookup,key=zip,getvars=pop,first=first, last=last)

46