Processing in an Integrated Data Management and Analysis System
Total Page:16
File Type:pdf, Size:1020Kb
DESIGN CONSIDERATIONS FOR OPTIMIZING FILE PROCESSING IN AN INTEGRATED DATA MANAGEMENT AND ANALYSIS SYSTEM Michael R. Wrona. Far West Laboratory for Educational Research and Development Norman A. Constantine. Far West Laboratory for Educational Research and Development Mark P. Branagh. Stanford University Introduction PRoe OBF first, then MERGE PROC OBF is a SAS (R) procedure for personal computers which converts an xBase file into a SAS dataset file. By convention, dbf ssd xBase files have the extension· .dbt and are often referred to as OBF files. SAS dataset files on the personal computer have the extension -.sscl" and will be refered to herein as SSD files. PROC OBF takes as input a OBF file on disk and outputs an SSO file to dbf l---=>~I ssd disk. This process is represented symbolically in Figure 1. IL_----1 PROC DBF dbf L-d_b_fj----=>""I ssd L--d_bf--ll----=>~I ssd dbf xBase JOIN first, then PROe OBF xBase file SASdataset Figure 2 computer's memory is refered to as an input operation. The Figure 1 process of writing data that is in the computer's memory to a disk (or other storage medium) is called an output operation. Input The xBase file strucbJre is an industry standard for personal operations and output operations are similar processes which require, for the most part, comparable amounts of time. Thus; computer database systems. As such, it has a large installed regardless of which direction the data flows, these operations are base. It is employed by, among others, the makers of dBase (A) refered to as inpuUoutput operations, or 110 operations. The most and FoxPro (TM). XBase generically refers to what personal efficient (optimized) system is the one whim minimizes 110 computer database users have previously called dBase oompat~ operations. The goal of this paper is to identify the file processing ible. Statis.ticians and data analysts frequently will be called upon to perform statistical analyses on the contents of xBase OBF files. technique with the lowest theoretical 110 count The SAS system is ideal for such analyses. In the simple case where the object is to perform an analysis using a single OBF file, Counting I/O Operations PROC OBF is employed to convert the OBF file to an SSO file and then the analysis is executed with the SAS system for The number of VOS required to read or write the contents of a personal computers. This paper explores the more complex database is of course dependent on the size of the database. To situation where the objective is to analyze the contents of two or simplify the. discussion which follows, we will assume that all OBF more related OBF files, asking the following question: At what files whose contents will be used in an analysis are the same point should PAOC OBF be invoked to optimize lile processing? size. This allows us to consider that the act of reading or writing Specifically, under what circumstances should one use the xBase the oontents of a database requires one 110 operation. The actual JOIN command rather than the SAS MERGE facility to combine number of llOs is merely a constant times the model formula. two or more databases? The two ways in which two OBF files Because we are interested in the relative efficiency 01 competing can be combined are shown in Figure 2. file processing methods, this approach is appropriate. How many IIOs are required to convert a OBF file into an SSO file Efficiency using PROe OBF? PROC OBF takes as input one disk file and outputs to disk a file in a different format. Thus, we shall say The answer to the question posed above must be predicated PAOe OBF requires two I/Os. (Because the input file and the upon the notion of efficiency. When working with databases, the output file have different file structures and will require different most effICient file processing strategy is generally the one which amounts of disk space, this is not technically precise, but it is minimizes disk accesses. This is because reading data from a adequate for our purposes.) The question of how many vas are disk into memory or writing data to a disk from memory typically required for the xBase JOIN command is a bit trickier. requires substantially more time than operations which can be performed entirely within the computer's memory. The process of The JOIN command takes as input two OBF files, merges them reading data from a disk (or other storage medium) into the into one file, and outputs the resulting file to disk. Clearly each 541 file input counts as one 110 operation. Since there are two input files, there are a total of two I/Os for the input operations associate~ with ~e JOIN command. The size of the file output however, IS contingent upon the way the files are joined. Join type: 1 to 1 The smallest output file would have zero records. The number of 110s for ~is case theoretically would be zero, ahhough in actuality xBase JOIN first, then PROC OBF the creatIOn of the database on disk would require some 1IOs. The largest output file would result when each record in one input file was joined with every record in the other input file. If each input file had M records, this type of join would result in an output file of M"2 records. These types of joins should rarely occur and are so unusual that they will not be discussed further. dbf The most common join we've encountered is a 1-to-1 matching of records between two databases. The theoretical maximum size of the resultant output file will be the sum of the two input files. In 3VOs practice however, the ever efficient analyst will retain only the dbf 12VOS~ ssd necessary variables from each database. OUr experience shows I I that the average database resulting from a 1-to-1 join is roughly equivalent in size to either of the input databases. Thus, the output from a 1-t0-1 join would result in' one 110 operation. dbf By applying this logic, the expected size of the resultant database in an M-t0-1 join of two databases will be M. In the case of a 2-to-1 join between two databases, each file input would be counted as one 110 operation and the file output would be counted as two 110 operations. The 2-to-1 join is contrasted with the 1-to-1 join in Figure 3. 5 Input/Output Operations Total Figure 4 Join type: 2 to 1 (4 I/0s) 1 Input c=J 2 Outputs Join type: 1 to 1 1 Input PROC OBF first, then MERGE 1 Input 21109"- CJ 10utput dbf ssd 1 110 1 Input / Join type: 1 to 1 (3 I/0s) ~1VO 21109"- dbf ssd 1 110 Figure 3 / WOrking wnh 2 InpUI DBF Flies Figures 4 and 5 compare the number of IlOs ree,.tired for doing 1-t0-1 joins with two input OBF files. Figures 6 and 7 show the number of IlOs required fordoing 2-10-1 joins with two input OBF 7 Input/Output Operations Total files. In each case, we look first at the efficiency of doing the Figure 5 xBase JOIN prior to running PACe OBF, and then at the efficiency of running PROe OBF twice followed by a SAS In figure 6, four IlOs are required for the xBase JOIN (one for MERGE. each file input and two for the output file with 2N records created by the 2-10-1 join). Four IIOs are required forthe PROe OBF In figure 4, three 110s are required for the xBase JOIN operation (one input of 2N records and one output of 2N records). In figure (two inputs and one output) and two 1I0s are required for the 7, two 1I0s are required for each execution of PROC OBF (one PROe OBF (one input and one OUtpUI). In figure 5, each input and one output) and four IIOs are required for the SAS execution of PROe OBF requires two IIOs (one input and one MERGE (two inputs each comprised of N records and one output output) and the SAS MERGE requires three lIas (two inputs and of 2N records). Thus, for a 2-to-1 join with two input files, it does one output). Thus, for -a 1-to-1 join of two OBF files, it is more not matter whether the files are JOINed or MERGEd. Table 1 efficient to use the xBase JOIN than the SAS MERGE. summarizes the information in Rgures 4 through 7. 542 Join type: 2 to 1 Summary: two dbf files xBase JOIN first, then PROe OBF PROCDBF xBaseJOIN done first done first dbf 2110s 1 to 1 I dbf I4VOS~ ssd I 7 5 2110s JOIN TYPE dbf 8 8 2 to 1 Table 1 8 Input/Output Operations Total Working with 3 Input OBF Flies Figure ~ Figure 8 shows the two' models for converting three input OBF files into a single output SSD file. Recall that the xBase JOIN is a binary operation. It can only join two databases at a time. The SAS MERGE tacili1y, however, can accommodate several datasets. PRoe OBF first, then MERGE Join type: 2 to 1 dbf I ~ PROC OBF first, then MERGE dbf I ~ I ssd I 2110s",- dbf ~ dbf ssd 11/0 / 821ms dbf ~ ssd 1 I/O dbf dbf 8 Input/Output Operations Total xBase JOIN first, then PROe OBF Figure 7 Figure 8 543 Figures 9 and 10 compare the number of 1I0s required for doing 1-10-1-10-1 joins wi1h duee input OBF files.