<<

DESIGN CONSIDERATIONS FOR OPTIMIZING FILE PROCESSING IN AN INTEGRATED DATA MANAGEMENT AND ANALYSIS SYSTEM

Michael R. Wrona. Far West Laboratory for Educational Research and Development Norman A. Constantine. Far West Laboratory for Educational Research and Development Mark P. Branagh. Stanford University

Introduction PRoe OBF first, then MERGE

PROC OBF is a SAS (R) procedure for personal computers which converts an xBase file into a SAS dataset file. By convention, dbf ssd xBase files have the extension· .dbt and are often referred to as OBF files. SAS dataset files on the personal computer have the extension -.sscl" and will be refered to herein as SSD files. PROC OBF takes as input a OBF file on disk and outputs an SSO file to dbf l---=>~I ssd disk. This process is represented symbolically in Figure 1. IL_----1 PROC DBF dbf

L-d_b_fj----=>""I ssd L--d_bf--ll----=>~I ssd dbf xBase JOIN first, then PROe OBF xBase file SASdataset Figure 2

computer's memory is refered to as an input operation. The Figure 1 process of writing data that is in the computer's memory to a disk (or other storage medium) is called an output operation. Input The xBase file strucbJre is an industry standard for personal operations and output operations are similar processes which require, for the most part, comparable amounts of time. Thus; computer systems. As such, it has a large installed regardless of which direction the data flows, these operations are base. It is employed by, among others, the makers of dBase (A) refered to as inpuUoutput operations, or 110 operations. The most and FoxPro (TM). XBase generically refers to what personal efficient (optimized) system is the one whim minimizes 110 computer database users have previously called dBase oompat~ operations. The goal of this paper is to identify the file processing ible. Statis.ticians and data analysts frequently will be called upon to perform statistical analyses on the contents of xBase OBF files. technique with the lowest theoretical 110 count The SAS system is ideal for such analyses. In the simple case where the object is to perform an analysis using a single OBF file, Counting I/O Operations PROC OBF is employed to convert the OBF file to an SSO file and then the analysis is executed with the SAS system for The number of VOS required to read or write the contents of a personal computers. This paper explores the more complex database is of course dependent on the size of the database. To situation where the objective is to analyze the contents of two or simplify the. discussion which follows, we will assume that all OBF more related OBF files, asking the following question: At what files whose contents will be used in an analysis are the same point should PAOC OBF be invoked to optimize lile processing? size. This allows us to consider that the act of reading or writing Specifically, under what circumstances should one use the xBase the oontents of a database requires one 110 operation. The actual JOIN command rather than the SAS MERGE facility to combine number of llOs is merely a constant times the model formula. two or more ? The two ways in which two OBF files Because we are interested in the relative efficiency 01 competing can be combined are shown in Figure 2. file processing methods, this approach is appropriate.

How many IIOs are required to convert a OBF file into an SSO file Efficiency using PROe OBF? PROC OBF takes as input one disk file and outputs to disk a file in a different format. Thus, we shall say The answer to the question posed above must be predicated PAOe OBF requires two I/Os. (Because the input file and the upon the notion of efficiency. When working with databases, the output file have different file structures and will require different most effICient file processing strategy is generally the one which amounts of disk space, this is not technically precise, but it is minimizes disk accesses. This is because reading data from a adequate for our purposes.) The question of how many vas are disk into memory or writing data to a disk from memory typically required for the xBase JOIN command is a bit trickier. requires substantially more time than operations which can be performed entirely within the computer's memory. The process of The JOIN command takes as input two OBF files, merges them reading data from a disk (or other storage medium) into the into one file, and outputs the resulting file to disk. Clearly each

541 file input counts as one 110 operation. Since there are two input files, there are a total of two I/Os for the input operations associate~ with ~e JOIN command. The size of the file output however, IS contingent upon the way the files are joined. Join type: 1 to 1

The smallest output file would have zero records. The number of 110s for ~is case theoretically would be zero, ahhough in actuality xBase JOIN first, then PROC OBF the creatIOn of the database on disk would require some 1IOs. The largest output file would result when each record in one input file was joined with every record in the other input file. If each input file had M records, this type of join would result in an output file of M"2 records. These types of joins should rarely occur and are so unusual that they will not be discussed further. dbf The most common join we've encountered is a 1-to-1 matching of records between two databases. The theoretical maximum size of the resultant output file will be the sum of the two input files. In 3VOs practice however, the ever efficient analyst will retain only the dbf 12VOS~ ssd necessary variables from each database. OUr experience shows I I that the average database resulting from a 1-to-1 join is roughly equivalent in size to either of the input databases. Thus, the output from a 1-t0-1 join would result in' one 110 operation. dbf

By applying this logic, the expected size of the resultant database in an M-t0-1 join of two databases will be M. In the case of a 2-to-1 join between two databases, each file input would be counted as one 110 operation and the file output would be counted as two 110 operations. The 2-to-1 join is contrasted with the 1-to-1 join in Figure 3. 5 Input/Output Operations Total Figure 4 Join type: 2 to 1 (4 I/0s)

1 Input c=J 2 Outputs Join type: 1 to 1 1 Input PROC OBF first, then MERGE

1 Input 21109"- CJ 10utput dbf ssd 1 110 1 Input /

Join type: 1 to 1 (3 I/0s) ~1VO 21109"- dbf ssd 1 110 Figure 3 /

WOrking wnh 2 InpUI DBF Flies

Figures 4 and 5 compare the number of IlOs ree,.tired for doing 1-t0-1 joins with two input OBF files. Figures 6 and 7 show the number of IlOs required fordoing 2-10-1 joins with two input OBF 7 Input/Output Operations Total files. In each case, we look first at the efficiency of doing the Figure 5 xBase JOIN prior to running PACe OBF, and then at the efficiency of running PROe OBF twice followed by a SAS In figure 6, four IlOs are required for the xBase JOIN (one for MERGE. each file input and two for the output file with 2N records created by the 2-10-1 join). Four IIOs are required forthe PROe OBF In figure 4, three 110s are required for the xBase JOIN operation (one input of 2N records and one output of 2N records). In figure (two inputs and one output) and two 1I0s are required for the 7, two 1I0s are required for each execution of PROC OBF (one PROe OBF (one input and one OUtpUI). In figure 5, each input and one output) and four IIOs are required for the SAS execution of PROe OBF requires two IIOs (one input and one MERGE (two inputs each comprised of N records and one output output) and the SAS MERGE requires three lIas (two inputs and of 2N records). Thus, for a 2-to-1 join with two input files, it does one output). Thus, for -a 1-to-1 join of two OBF files, it is more not matter whether the files are JOINed or MERGEd. 1 efficient to use the xBase JOIN than the SAS MERGE. summarizes the information in Rgures 4 through 7.

542 Join type: 2 to 1 Summary: two dbf files

xBase JOIN first, then PROe OBF

PROCDBF xBaseJOIN done first done first dbf 2110s 1 to 1 I dbf I4VOS~ ssd I 7 5 2110s JOIN TYPE dbf 8 8 2 to 1

Table 1 8 Input/Output Operations Total

Working with 3 Input OBF Flies Figure ~ Figure 8 shows the two' models for converting three input OBF files into a single output SSD file. Recall that the xBase JOIN is a binary operation. It can only join two databases at a time. The SAS MERGE tacili1y, however, can accommodate several datasets.

PRoe OBF first, then MERGE Join type: 2 to 1 dbf I ~ PROC OBF first, then MERGE dbf I ~ I ssd I

2110s",- dbf ~ dbf ssd 11/0 / 821ms dbf ~ ssd 1 I/O dbf

dbf 8 Input/Output Operations Total xBase JOIN first, then PROe OBF Figure 7 Figure 8

543 Figures 9 and 10 compare the number of 1I0s required for doing 1-10-1-10-1 joins wi1h duee input OBF files. Figures 11 and 12 show Ihe number 01 ves required for doing 2-10-1-10-1 joins wilh Join type: 2 to 1 to 1 Ihree input OBF files. In each case we look first at !he efficiency of doing Ihe xBase JOIN twice prior to running PROC OBF, and xBase JOIN first, then PROe DBF then at the efficiency of running PROC OBF three times foUowed by a SAS MERGE. Join type: 1 to 1 to 1 xBase JOIN first, then PROe DBF dbf 3~ dbf dbf V 21105 4110t-.. ssd ~dbf /' dbf 21!0s

11 Input/Output Operations Total

Figure 11 8 Input/Output Operations Total Figure 9 Join type: 2 to 1 to 1 Join type: 1 to 1 to 1 PROC DBF first, then MERGE PROC DBF first, then MERGE

21/08"- dbf ssd I dbf 2I/0s"- ssd / /::f\1 1/0 r-l I dbf121/08) ssd ~1 1/0 ssd dbf 2I/0s"- ssd 1-11I!..!111~~0---41 ssd /,--~ 1 II 21/08 11/0

21/05"- 21/0s"- ssd dbf / ssd dbf / '----'

11 Input/Output Operations Total 10 Input/Output Operations Total

Figure 10 Figure 12

544 In figure 9, three 1I0s are required for each xBase JOJN (two Conclusion inputs and one output) and, since the databases are not increas­ ing in size with each join, two 1I0s are required by PROC OBF Comparing table 1 to table 2 reveals that for 2-t0-1 (and (one input and one output). In figure 10. two IlOs are required by 2-t0-1-to-l) joins, there is no advantage to using the xBase JOIN each execution of PROC OBF (one input and one output) and four command to join two databases. It is just as efficient to first I/Os are required for the SAS MERGE (three inputs and one convert the xBase databases into SAS datasels and then MERGE output). Thus, for a l-to-1-to-l join of three OBF files, it is more them. Because the analysis will be done with SAS software, and efficient to uoo the xBase JOIN than the $AS MERGE. because it takes time and effort to switch between the SAS package and the xBase product. it is logical to begin by execu.ting Figure 11 shows the 2-1O-1-t0-1 xBase JOIN. In this diagram, the the PROC OBF statements and continue with the SAS analysIs. first (top) join is a 1-to-1 join requiring three lIOs (two inputs and In the case of l-to-l joins involving two or more databases, one output) and the second (lower) join is a 2-to-l join requiring however, there appears to be an advantage to using the xBase four IIOs (two inputs of N records and one output of 2N records JOIN to create the analysis dataset prior to entering the SAS created by the 2-10-1 join). PROC DBF requires four IlOs (one system. This of course assumes that the databases to be input of 2N records and one output of 2N records). In figure 12. analyzed are sufficiently large to warrant the extra time and effort. two 1I0s are required for each execution of PROC OSF and five 1/ OUr model does not take into account the time required to start Os are required for the SAS MERGE. Thus, a total of eleven VOs up and quit the xBase package. Further, the model is based on a are required whether PROC OBF is invoked first or the xBase number of assumptions which are not intended to represent every JOINs are performed first. Table 2 summarizes the infonnation in sibJation or take into account every determinant of PC file Figures 9 through 12. processing behavior. System hardware and software cif!erences (disk caching, memory management. pagemode versus Inter-. leaved memory access, the presen~ or absen~ ~f a RAM di~k. memory caching, system configurabon) playa slgmficant role In file processing performance. In the ca~e of 1-t0-1 joins inyolving Summary: three dbf files large input OBF files, however, one senously should consider doing xBase JOI Ns prior to invoking PROC OSF.

For further information, contact: PROCDBF xBaseJOIN Michael Wrona done first done first Assessment Services Program Far West Laboratory 730 Harrison Street 1to1to1 10 8 San Francisco. CA 94100 JOIN TYPE (R) SAS is a registered trademark of SAS Institute Inc., Cary, NC, 11 11 USA. 2to1to1 (R). dBase is a trademark of Ashton-Tate Corporation, Torrance, CA. USA. (TM) FoxPro is a trademark of Fox Software Inc., Perrysburg, OH, USA. Table 2

Database theory holds that 1-t0-1 (or 1-to-1-10-1) joins of databases with the same number of records should never happen. since such databases are improperly normalized. However, in the real world, these types of joins do occur for several reasons. First, sometimes different people or organiza­ tions in different locations are collecting related data about the same subjects. This is especially true of large bureaucracies. Another reason why 1-to-1 (and 1-t0-1-1O-1) joins with several databases can occur is because some databases contain very sensitive information which requires special security. The sensitive information must be stored in a location apart from the less sensitive information. Such joins also occur when collecting baseline and time series data. The baseline data may later be merged into the data collected at any or each time point. Finally, of course, such joins also are required at times because data­ bases were improperly designed.

The authors have encountered the 2-to-1-to-1 join while con­ structing analysis datasets containing research data about twins. To use this example to explain more fully the diagram, the top database could contain data provided by the father (collected during a confidential one to one interview), the middle database could contain data provided by the mother (collected during a confidential one to one interview), and the lower database could contain data about the twins culled from hospital records. In this example, the mother and father data are joined first, keeping only those variables necessary for the analysis. Then the parent information is joined with the twin information, and a separate record for each child is created.

545