<<

Efficient Use of Numeric and Data Types

Una Smith, Duke University

Abstract • To enhance program clarity, elegance and conVe­ nience. A basic understanding of data types in pro­ gramming la.nguages is necessary for the efficient con~ • To enhance end-user quality control and data struction of both data sets and programs in the SAS' safety. System. This paper addresses the efficient use of nu­ meric and character data types at two levels of project development: initial plauning and ongoing mainte- Progress toward the first two of these goals can be nance. precisely .stated in terms of the computer resources For researchers and planners, some general con­ they use. Frequently, there is a tradeoff between tbe cepts of good data design are outlined! with an em­ two; in order to write faster data steps, SAS pre­ phasis on data types; these concepts should help you grammers must often larger data sets. Tbis usually to design data. sets tha.t are both clearer and easier occurs when character variables are replaced with nu­ to work with. SAS users who have not agonized over meric. ones, or vice versa. The interaction may also the choice of numeric or character variables may dis­ be a positive one, with improvements on both counts; cover that there are good reaso"ns for exercising care this tends to occur most often when the two remain­ ing goals) convenience and control Me kept in mind. in ~his matter. Those who have may find a. review of f the major issues helpful as they learn how to use the These two goals are very subjective; experienced users powerful SAS data step more efficiently. ga.in a.n intuitive sense about which techniques are ef­ fective for the kinds of data they work with. There For the data base manager 1 some pra.ctical strate­ is generally a tradeoff between these goals as well; gies for working around sub~optimal types ate de­ scribed. The significance of numeric variable lengths by definition, expert users are capable of writing per­ for storage requirements on various computer plat­ fectly correct programs tha.t most inexperienced users forms is covered, and the use and misuse of boolean would not understand. concatenated, date, time and datetirne values is di~ "Efficiency" can mean tuning your for each cussed. Important points are illustra.ted with com­ of these, but there may be unexpected interactions plete SAS data step programs. between them; I make no attempt at providing the ~/right" answer) only suggestions for thoughtful decision-making; they may help you find the best so­ 1 Introduction lution for your own computing problems. In particular, I show how some common usages of Experienced users of the SAS System typica.lly have numeric and/or character types affect each strategy. four goals in mind when thinking about good data When you return to your familiar data sets at the end design: of this conference) you may well see some of your fa­ • To reduce CPU usage or improve turn-around vQrite variables in a new light. Or maybe not. Either times. way, you will have a. better feeling for the data; you will see more clearly what the data is, not just what it • To reduce storage requirements. look. like. Also, you will be better able to choose be­ tween numeric and character vstiables to make your I SAS is a registered kadeJILark of SAS lnstitl.lle. Inc., &ry, NC, USA. The word SAS, when used as a. noun in this paper, data sets and progra.ms both more efficient and easier refers to the SAS programming language. to understand.

329 2 Data Types 3. Character functiollB

Data types are abstractions that nonetheless have 4. Date and time functions very powerful, almost physical roles in computer func­ tion. A is a contract betwc<:n your soft­ 5. State and Zipcode functions ware and your computer) whereby a. name or number 6. Special functions stored in an electrical circuit somewhere inside the computer's memory can be retrieved a.gain in a. form 7. System functions that the software recogni.es. A data type is also a contract between your soft;:. 8. Truncation ware and you, specifying exa.c.t1y what you may do For example, the SAS substr function expects to data of that type. The functions and procedures three arguments; the first argument must be a vaiue built into a software package also use data types to of type character, such as 'SUGI 16'. The two define the types of input they require and output they remaining arguments must both be numeric valuesj create. Each function and procedure specifies exactly such as 1 and 4. Thus the expression the data type that each value must have when used as an argument to it. Likewise, the result of a function substr( 'suin 16', 1, 4) returns the result has a predefined data type. Note that, by definition, lSUGI', a charader value. The numbers and types a procedure may modify the values of its arguments, but does not give a separate result. Also note that a of arguments expected by each SAS function are ex­ value may be either a constant, or the contents of a plained in detail in "II versions of the SAS manuals. variable. 2.3 Values 2.1 Operators Most computer programming languages ha.ve large The Pascal language has several types of operators sets of data types, and allow the construction of ad­ for numeric valuesj division. is performed by both div, ditional user-defined types. As a rule of thumb, the which gives results, and /, which gives real more specialized a language is, the fewer da.ta types results. SAS has a rich variety of unique operators, it supports. The REXX language has no data types: as well as a variety of functions that substitute for SAS has two. The Pascal language, in its simplest special operators. For """mple, the SAS boolean op­ form, includes four types: char for single characters, eratots and. oro, and not operate on numeric values. integer for numbers such as 0, 1, 2 etc., boolean for

the values True and False j and real for numbers such as 0.0, 1.33 ... , etc. 2.2 Functions The SAS language includes only numeric and char­ As was mentioned briefly at the start of this sec­ acter data types. The type of .. constant value is d.,. termined by the absence or presence of quotes around tion j the argument or arguments to each SAS func­ tion must be of a particular type, either numeric or the value. The type of a variable value is determined character, and the result of each function is also of by an attrib, length, retain or inpllt statement, a particular type. The SAS functions are grouped, or by assignment from a function or operator. more or less according to these data types, as follows:

1. numeric fundions, including 2.4 Variables (a) Arithmetic Because SAS has only two data types, numeric and character, the choice of which to use for variables be­ (b) Financial comes an issue for careful consideration. (Relatively (c) Mathematical little thought is needed to select an appropriate data (d) Probability distributions type when there are do.ens of types to choose from!) SAS has several unusual conventions which supetfi.~ (e) Quantiles cially approximate the function of additional data (f) Random number generation types. These are the numeric SAS date, time, and (g) Sample statistics datetime values and their associated SAS informats, (h) Trigonometric and hyperbolic formats and functions, and the Mt mask convention along with the and rank functions. These con­ 2. Array functions ventions must be understood in detail if they are to be

330 used efficiently and Bafely. In particular, it is impor­ Use erlreme caution; read the varjous discussions on tant to know how the number of used to store a this topic in the SAS documentation and talk to " numeric variable affects the accuracy of values under professional SAS programmer, and proceed with -cau­ these conventions. conventions tion, or play it safe and use .. length of 8 bytes for all variables whose values may have decimal parts. 2.5 Informats and Formats The following is another quick and dirty, but defini~ tive test of the least number of bytes needed to store The utility of both data types (but particularly the all values of a variable correctly: numeric type) i. enhanced hy the use of informats 1. Use a procedure such as univariate to find the and formats, and the format procedure which en­ l minimum and maximum values stored in the vari­ ables your to create your own SAS forma.ts. Note a.ble, jf necessary. that user-defined informat. are not allowed, for rea­ 2. Figure out how many bytes you probably need, sons that go heyond the scope of this paper. SAS llsing the technique described above, or use the informats and formats, respectivelYl affect interpre­ least number that your SAS implementation will tation of data values only on input and output; it allow. On most , at least 3 bytes are may be helpful to think of them as filtecs or tem­ needed to store any numeric varia.ble, but on it plates which allow data to pass through in only one few machines 2 bytes may be enough. If you direction. These and other conventions are discussed don't want to bother with this step, use 3. at greater length in the following sections. J. Run this program: Yolet least 3·. 3 Numeric Conventions Xlet minimum :­ 0; Y.let maximum ;;;; 99.99; To represent a fractional or floating point value such %let interval 0.01; as 0.333333 ... to the fullest precision available, a variable with a length of 8 bytes must be used. The Y.do bytes = aleast 7,to S; choice of length for numeric va.riables depends in part data temp; on whether the values stored in that variable are of length i lbytes; type integer, or fixed or floating point (both are of do i = ~minimum to maximum type real). by I:interval; To create what passes for an integer in the SAS output; language, use the int I trunc I and round functions. end; To create a fixed point (or decimal) number, use the run; round function: proc sort nQdupkeys; 1/3 - 0.333333 .. . by i; rOUlld(1/3,0.01) - 0.330000 ... - 0.33 run; round(1!3,1) - 0 Xput 'Length of i was &bytes .. ~; 3.1 Fixed Point %end; Generally, fixed point variables can b~ treated much 4. Insped the output log. If any observations were like in the calculation of necessary lengths, dropped during the SORT procedure, then the if you pretend momentarily that there is no decimal number of bytes used was too small. point. For example, a variable whose values range Another way to store fixed point data using the between 0.0 and 9.9 by 0.1 increments is similar to fewest bytes possible is to adjust the units of mea,. a variable whose vaJues are integers ranging between sure) so that the data can be stored as integer values o and 99. For SAS version 6.04 and earlier, compare instead. Do this only if the data are still intelligible this range of iinteger i values to the values listed in the in the new units. For example~ it may make serise to tables at the end of the section on the length state­ express tree trunk diameters in (integer) millimeters ment in the SAS reference manuals. You must 'Use a instead of the more usual tenths of centimeters (i.e., version of the ma1J:lltil appropriate fOT your computer. 19 mm instead of 1.9 em). The effort of writing down Also, the length you choose may be too small on a.n­ . and keying in the decimal point would be saved, and other c.omputer; this is true whenever you Ilse less the values could be displayed in centimeters at any than 8 bytes to store any numeric variables in SAS. time, simply by defining a format using the picture

331 statement of the tormat procedure. Unfortuna.tely, 3.3 Boolean this strategy ma.y not work if you use units that are Some programming languages include a. true boolean not metric; how would you express 10.2 inches with­ typej SAS has a convention which approximates the out the decimal place? function of a . In SAS, an expres­ sicn testing that true, a numeric variable, does not 3.2 Dates, Times and Datetimes contain the va.lue 0 would be written if true. The SAS has a special convention for storing da.tes as nu­ SAS compiler expa.nds the expression meric values representing the number of da.ys since January 1, 1960. A similar convention is available if true for storing datetime values as the number of seconds into since midnight, January 1, 1960. The SAS Time va.lue if true ne ~l is simply the number of seconds .ince midnight. The not SAS date cOIlvention is extremely useful for storing if true eq 1 dates, but it has several drawbacks; the stored dates before interpreting it. The numeric value 0 1:::1 equiv­ appear to be random integer values, and are not rec~ ognizable without the use of .. SAS date format. Also, alent to the boolean value False; all other numeric because these dates are actually stored as numeric values are equivalent to the boolean value True_ values, the SAS compiler will not complain about in­ appropriate operations (such as multiplication) using 4 Numeric Concatenation date values. Because they are easier to read! many SAS users Character values are frequently concatenated. For ex­ prefer to use Julian date values_ UnfortunatelYl there ample, variables such as city, st~te and zip may be ;s a problem with this method as well: Julian dates combined to form a convenient new variable for use are another example of concatena.ted values. A Julian in a mailing list. There is a large or functions for dale (e.g., 91051, Ihe 51st day of '91) can be stored ha.ndling char-ader valuesl including the very useful conveniently as a numeric value, but it is actually two substr function for subsetting a concatenated char­ values: year and day within year_ Julian dates can be acter value. The statement compared to one another safely, since they sort prop­ erly _ However, they can not be used reliably to calcu­ exchange = substr(phone. 4, 3); late intervals, nor can intervals reliably be added to is certainly concise, but is it effiCIent? Because or subtracted from Julian dates. Despite the draw­ substr expects a character argument and returns backs that result from not having a true da.ta type a character result, but both Phone and Exchange for date values, these two formals are both powerful have already been declared numeric, SAS must per­ and -convenient, when used together with the many form two implicit type conversions_ This data step informats, forma.ts and functions that SAS provides took 417.3 CPU seconds to complete, and without to help you: the length statement, the variable Exchange would • Compare a date to other dates_ have been of type charader_ When the statement was • Subtract one date from another to calculate an rewritten as interval of time_ exchange " • Add or subtract integer values (days) from a date input(substr(put(phone,10.),4,3),3.); to calculate a new date. • Using date functions, add or subtract months and the data step ran 37% faster, finishing in 2£2.8 CPU years ftom a date. seconds, Are these type conversions really necessary? Both the argument and the result values ar-e numeric; • Using date informats and formats, read and write it should be a simple matter to systematically convert dates in virtually any style. one number to another using one Or more arithmetic • Store dates in numeric v3Tiab1es as either Julian functions and numeric operators. With the statement or SAS date values. rewritten once more t.o • Using The datejul and juldate functions, -con­ vert SAS date values to Julian values and from exchange" int(mod(phone, 107) I 1e4); Julian values to SAS date values. the data step ran 68.1 % faster, using only 133.2 CPU • Remember: operations other than comparisons seconds. All three versions of the data step used the should be performed only on SAS date values. same amount of menlOry_

332 The lesson in this example is dear: Avoid unneC­ filename Jteatbits.dat'; essary type conversions, but when type conversions are necessary, make them explicit. Most SAS pro­ data example (drop=i); grammers know that type conversions are expensive. length number 3: but only a. fraction of them rea.li~ that type conver­ infile bits; sions that Me not requested through the input or put ** ~ariab1eB in or~er right to lett j functions cost even more. input (18 x7 x6 x5 x4 x3 x2 xl) ($1.); ** store data as bits number = 0; 5 Extra Credit array [8] xl-x8; do i =- 1 to 8; The best wa.y to learn how numeric and character ** expensiv~ exponential in loop data types affect program execution it to practice us~ if hit[i] • '1' ing them. Included below are two example situa.tions th~n number + 2"{i - 1): in which the same algorithm i. implemented using end; either numeric or character types. CPU times from char = byte(number): large test runs are provided: the task of calculating •• tinally, inspect with bit masks disk data storage requirements is left to you. if char = '.1. .. 1.. 'b then x3andx7 1·, if char : 100000000~b then none 1·, 5.1 Bit Masks if ~har = '0 ...... 'b then notS 1; run: SAS has several functions designed for manipulating the individual bib of a value (which can be either nu­ The value of the variable x3and7 will be 1 for the mer'icor chara.cter, but is generally c.haracter). Before second subject and missing for the other two. The variable nOne will equal 1 for the second subject onelYt attempting to use bit masks to perform bit testing l read the SAS documentation carefully. If you use while the variable not8 will have the value 1 for both bit masks to store and reLrieve da.ta., you must un­ the seooDd and the third subject. derstand the significance of the ASCII and EBCDIC standards for represen.ting data; they use different bit 5.2 Check Digits pa.tterns to represent the same v"lue. If you transfer a data file hetween an ASCII computer such as a PC If you are responsible for the design of large or ex~ pensive projects in which data-entry personnel key and an EBCDIC computer such as an IBM mainframe in multiple da.ta. .sets whose records are eventually (in either direction), the bit patterns in the file will matched using a unique record key, you should prob­ be changed so that the values are kept the same. ably be llsing a check digit scheme to ensure the accu­ If you store data either an ASCl! bit paU,ern or racy of those keys. Check digits are useful only if you an EBCDIC bit pattern, you must take special pre~ are able to assign the unique key, including a check calltions to preserve the bit pattern whenever moving the data to another computer platform. Also, bewa.re digit, be/ore the data is collected. With careful management, check digits can be used that the bit mask convention does not work in the with numbers such as SSN's. However, in situa.tions same way for the two SAS data types, and that the where the use of check digits is practical, it is often Qumeric value 1991 is stored using a bit pattern that practical also to generate unique serial numbers using i. completely different from the pattern used to store Ii simple iterative routine. For example, credit card the cha.racter value i 199 1 J • companies and banks typically assign account num~ The use of hit masks i. shown best with a sample bers whose last digit is a check digit. Whenever you SAS program. Imagine that you have collected data c.all a credit card company to enquire about your ac­ on the simple binomial responses of three subjects to a count, the agent}s computet verifies that the last digit series of 8 tests, and tbat the results are conveniently matches the rest, according to a fairly simple algo­ recorded as a string of Ps and 0'5 in a. compu ter file rithm, before. attem.pting to toc-ate the accqunt nu.mber named te.tbita .dat: in Its data hasc. Thus tbe correctness of the number 10100000 is checked at the point of data entry, where any er­ 01000100 ror can be instantly repaired. Some mistakes will be 00000000 missed, but the most common errors (transpositions and simple substitutions) are caught immediately and Here is a SAS/PC program, testbits. sas: effortlessly.

333 The check digit algorithm ·is as foHows: right-adjusted and padded with 0'. on the left. A 1. Some definitions are in order: Note tha.t the check digit is calculated by accumulating the value of words digit} nt'mer.al, and number are not inter~ each digit of serial in the numeric variable check. changeable. For example, the first digit of the Nole that use of a numeric variable is required by this number 23) counting from the left, is the numeral algorithm, because of the addition step. 2; the digit and the number are both odd, but the The accumulation is done in two parts, first for numeral is even. the odd digits, then for the even ones. The odd dig­ 2. Consider a unique key or serial number to be a. its (working from left to right) are transformed via fixed-length string of digits. a user-defined character format. Remember tha.t for­ mats work only with the put function and put state. 234768891 ment, that the type of the format determines the ex­ pected type of the argument to the put function, and 3. Working from left to right, double each odd digit, that the put function returns only character values. subtracting 9 if the result is greater than 9. Note An input function and a numeric informat are needed that this is the same as adding together the two to convert the transformed character value to a nu­ digits of the result. For example, if the digit is meric value. 9, then twice 9 is 18, 18 minus 9 equals 9, which Once the calculation of check is complete, the equals 1 plus 8: value is converted back to a character value and stored in Valid. (It is probably a good idea to declare 18 the check digit to be of the same type as the value 9 1+8 = it checks.) To modify this program [or your own use, 4. Snm the results of the odd digit calculations and you need only change the %LET statements. the values of all the even digits: Y.let library : sugi; Xlat dataset ; nocheck; 234758891 -+ Xlet idnum serial; (2 + 2) + 3 + (4 + 4) + 7 + (6 + 6 - 9) + "let idlen ; 9·, 8 + (8 + 8 - 9) + 9 + (1 + 1) Y.let i 2347688919 data _null_ t To verify a number whose last digit is a check digit, set &library .. ldataset; strip off the check digit and use the remaining string check = OJ of digits to calculate a new check digit. Compare the do i:l to ~idlen by 2; •• odd digits two digits; jf they do not match, the number has been check + input(translate( miskeyed. Use some actual MasterCard or Visa credit sobstr(tidnum, i , 1), card numbers to verify the correctness of your SAS '0246813579', program, and include some 'mistakes' to see what '0123456789'), 1.): sorts of errors the algorithm does not catch. (And endj be sure not to leave the credit ca.rd numbers lying do i~2 to &idlen by 2j *. even digits ; around when you are through.) check + Several versions of a SAS program implementing input (substr(tidnum, i, I.), 1.); this algorithm are given below. The data JlulLstep end; reads in a data set containing serial, a 9-digit char­ check; mod(10 - mod(check, 10), 10); acter value. The program assumes that all values are tidcbk= put(check, 1.):

334 run·, This version took 71.9 CPU seconds (1LO% as long as the original) to romplet". Why not ccnvert serial The data ..null_ step took 254.0 CPU seconds to to a numeric varia.ble at the sta.rt? The algorithm process 100,000 observations. By using the format might then be written: procedure and put function in place of translate in the first do loop, like this: ** convert serial to numeric value proc. format i data _null_; "alue $odd set I:library .. ,tdata.set; ;: JO~ '0' check = 0; '1' '2' digit = 0; '4' '2 ' number = input (iidnum , tidlen .. ); '3' '6' i = tidlen; '4' '8' do while(number > 0); '5' '1 ' digit = mod(number j 10); '6' ;; '3 l number = int(number / 10); '7' '5' if mod(i, 2) = 1 then do; '8' '7' digit = 2 • digit; ~9' ; 'g' if digit> 9 and then digit = digit - 9; end; check + input (put (substr(lidnum, i, 1.), i e i - 1; $odd.),1.); check + digit; end; instead of check = mod(10 - mod(check. 10), 10); check + input(translate(substr(~idnum,i~l), kidchk = put(check, 1.); '0246813579' , run; '0123456789'), 1.); ran in 75.8 CPU seconds, only 10.7% as long as the first version using translate, and only 73.8% The modified data step ran in 102.8 secondsj Ot 14.5% of the origina.l time. Now, since the addition as long." the version using put (. .. , $odd.). Had step can not be avoided, why not convert each digit to serial been a numeric variable, no doubt the pro­ a numeI'ic value at the start, before any calculations'? gram would have run scmewhat faster than this last A new numeric variable, Digit, is needed: example, since two type conversions per observation wonld be eliminated: However, the savings pwba.­ ** convert each digit to numeric value; bly would not be very substantial, since the numeric version performs, in effect) ten extra divisions a.nd data _null_i multiplications for every observation. set &library .. tdatasetj check ~ 0; digit = 0; 6 Points of View do i=l to tidlen by 2; ** odd digits ; digit = 2 • Experienced SAS programmers tend to hold one of inp~t($ubstr{&idnum, i, 1.). 1.); two extreme points of view on the best method of.as­ it digit> II signing numerlc ·or charader data types to variables. then digit digit - 9j check + digiti 6.1 Numeric end; do i=2 to tidlen by 2; +* even digits ; Declare variables numeric unless the values check +- contain non-numera.l charaders. . (Use nu­ inpnt{substr{tidnum, i, 1.), 1.); merals whenever possible.) end; New users generally employ this strategy also, en­ check = mod(10 - mod(check, 10), 10); couraged by the synt.ax of the input statement, which kidchk = put(check, 1.); assumes an data. wiU be numeric onless otherwise run; indicated. Many experienced users find typing the

335 required qnotes around character constants in their 8 Additional Reading code to be a. nuisance. Some long-time users will re­ call the difficulties that could result from using key­ 1. Cooper, D. Condensed Pascal, W. W. Norton & boards with' and' keys, and monitors that displayed Company, New York, NY, 1S87. Pages 5~13. both as 1; an accidental, invisible' can be very hard 2. SAS Language and Procedures: Usage, Version to locate. Numeric constants do not require quotes, 6, First Edition, SAS Institute, Inc., Cary, NC, a.nd tbe speeia.l conventions, informats, and formats 1S8S. Chapters 6~7, Working with Numeric (including picture formats) for numeric values make Variables and Working with Character Variabl.,., producing elaborate output extremely easy for skilled and Chapter 13, Working with Dates in the SAS programmers. This strategy emphasizes programmer Syste.m. Also, i.n Part 5) see Unde.rstanding Your convenience:. SAS Session, pp. 279~330. 3_ SAS Lang"age Guide for Personal Computers, 6.2 Character Release 6. OS Edition, SAS Institute, Inc., Cary, Declare charader variables unless the values NC, 1988. Chapter 3, Expressions, Chapter 4, are quantities. Fundions, Chapters 13~14 on SAS Infarmats This strategy is popular with managers of large, and Formats, SAS Output and Missing Values, expensive and/or sensitive projects, espe-ciaUy when and discussion of the length statement on pp. fairly inexperienced userS have

336