NESUG 15 Beginning Tutorials

Proc SQL vs. Pass-Thru SQL on DB2® Or How I Found a Day

Jeffrey S. Cohen, American Education Services/PHEAA

Abstract When extracting data from multiple tables, you must employ Join conditions to avoid a Cartesian product, Many people use Proc SQL to query DB2 where each row from table one will be joined with databases yet ignore the Pass-thru ability of this each row in table two. A Join condition will tell the Proc. This paper will compare the Proc SQL of DBMS how to match the rows of the two tables. Consider the above example, only now we want the SAS® and Pass-thru SQL on the same 30 million- color of the person's car as well: row table to highlight the time and resource Select A.Name, savings that Pass-through SQL can give. A.Age, A.City, Overview B.Color From Table_Demographics A Inner Join One of the most useful features of SAS software is Table_Cars B on its ability to read data in almost any form or A.Name = B.Name storage medium from multiple data sets. When Where A.Age < 30 and accessing data from a Database Management (A.City = 'Syracuse' or System (DBMS) such as DB2, the SQL procedure A.State = 'Maine') is required. Like many SAS procedures there are a Notice that when selecting multiple columns a variety of options available, some greatly comma must separate them. Also notice that the impacting performance. We will examine three table Table_Demographics is given the pseudonym techniques to use when joining three tables with of A and the table Table_Cars is given the Proc SQL and the results on performance. pseudonym of B. These tell the DBMS which table the columns come from. SQL Basics Data Description To extract data from DB2 tables, SQL is the preferred method. SQL uses a minimum of two The data for this exercise comes from three DB2 clauses: the Select clause and the From Clause. The tables in a data warehouse. The first table, (DCA), Select clause lists the columns or variables that you holds all transactional data and totals approximately want in the final data set or result set. The From 29 million rows. The second table, (DCB), houses clause tells DB2 what tables the data is coming from. application information and totals approximately 3 A third clause that is useful is the Where clause, million rows. The third table, (DCC), houses claim similar to the If statement in a data step. In the information and contains just under 1 million rows. Where clause, you list the conditions, separated by The keys, used to join the tables, common to the and or or. To illustrate this concept consider the DCA and DCB tables are the borrower ID (AF-APL- following example: ID), the borrower ID suffix (AF-APL-ID-SFX) and the timestamp (LF-CRT-DTS-DCB) for the DCB Select Name, table. The keys common to the DCB and the DCC Age, are the user who created the record (BF-USR-CRT- City, DCC) and the timestamp (BF-CRT-DTS-DCC) for From Table_Demographics the DCC table. The assignment is to select 18 Where Age < 30 and columns from the DCB, 2 from the DCC and 15 from (City = 'Syracuse' or the DCA. State = 'Maine') Proc SQL This example will produce a data set with three variables consisting of people who are under 30 and live either in Syracuse or Maine. Using SAS's Proc SQL, the query is structured to create a table by selecting the columns from the Join of the three tables. "The SQL Query Optimizer must choose between the following to join the tables:

1 NESUG 15 Beginning Tutorials

S Sorting the tables and performing a NOTE: PROCEDURE SQL used: match merge. real time 16:14:16.15 S Accessing the rows of one table cpu time 15:57:57.26 sequentially, and fetching the matching rows from the other table via an index This query ran on an IBM AIX SP2 system with SAS on that table. V8.1. You can see from the log, the query's S Loading the rows of the smaller table performance was poor, taking almost 16 hours of into memory and processing the rows of CPU time to complete. the other table sequentially."1 Proc SQL with DBKEY Since SAS cannot read DB2 indexes without assistance and the tables are too big to fit into One method of speeding up the query is to force Proc memory, the only option is to sort the tables and SQL to use the DB2 index. This is accomplished by perform a match merge. The query and results look using the DBKey option in the From clause. In the like this: DBKey option, the key columns from the table on the right are listed. The From clause of the previous 1 The SAS System query looked like this after the change: NOTE: Copyright () 1999-2000 by SAS Institute Inc., Cary, NC, USA. from mydb.dcb_claim a left outer join 6 libname mydb =mydbase mydb.dcc_clpkg (dbkey=(bf_crt_dts_dcc user=XXXXXXXX using=XXXXXXXX bf_usr_crt_dcc)) b on 7 owner=mydb; a.bf_crt_dts_dcc = b.bf_crt_dts_dcc and a.bf_usr_crt_dcc = b.bf_usr_crt_dcc and NOTE: Libref MYDB was successfully assigned as follows: a.lc_sta_dcb in ('03','04') inner join Engine: DB2 mydb.dca_activity (dbkey=(af_apl_id af_apl_id_sfx Physical Name: mydbase lf_crt_dts_dcb)) c on a.af_apl_id = c.af_apl_id and 17 proc ; a.af_apl_id_sfx = c.af_apl_id_sfx and 18 CREATE table EXPORT AS a.lf_crt_dts_dcb = c.lf_crt_dts_dcb 19 select a.af_apl_id 20 ,a.af_apl_id_sfx The new query returned these results: 21 ,a.lf_crt_dts_dcb 22 ,a.bf_ssn 6 libname mydb db2 database=mydbase user=xxxxxxxx 51 ,c.LC_TRX_TYP using=XXXXXXXX 52 ,c.LD_TRX_EFF 7 owner=mydb; 53 ,c.lc_rco NOTE: Libref MYDB was successfully assigned as follows: 54 from mydb.dcb_claim a Engine: DB2 55 left outer join Physical Name: mydbase 56 mydb.dcc_clpkg b 57 on 58 a.bf_crt_dts_dcc = b.bf_crt_dts_dcc 17 proc sql; 59 and a.bf_usr_crt_dcc = b.bf_usr_crt_dcc 18 CREATE table EXPORT AS 60 and a.lc_sta_dcb in ('03','04') 19 select a.af_apl_id 61 inner join 20 ,a.af_apl_id_sfx 62 mydb.dca_activity c 21 ,a.lf_crt_dts_dcb 63 on 22 ,a.bf_ssn 64 a.af_apl_id = c.AF_APL_ID 23 ,a.lc_sta_dcb 65 AND a.af_apl_id_sfx = 24 ,a.lc_aux_sta c.AF_APL_ID_SFX 25 ,a.lc_pcl_rea 66 AND a.lf_crt_dts_dcb = 52 ,c.LD_TRX_EFF 67 53 ,c.lc_rco 76 or c.la_apl_int ^= 0))); 54 from mydb.dcb_claim a NOTE: Table WORK.EXPORT created, with 27035622 55 left outer join rows and 35 columns. 56 mydb.dcc_clpkg(dbkey=(bf_crt_dts_dcc bf_usr_crt_dcc)) b 77 QUIT; 57 on 58 a.bf_crt_dts_dcc = b.bf_crt_dts_dcc 59 and a.bf_usr_crt_dcc = b.bf_usr_crt_dcc

2 NESUG 15 Beginning Tutorials

60 and a.lc_sta_dcb in ('03','04') and passwords to the database. The next two 61 inner join statements are standard Proc SQL Create Table and 62 mydb.dca_activity(dbkey=(af_apl_id Select statements. The asterisk in the Select af_apl_id_sfx lf_crt_dts_dcb)) c statement tells SAS to put all the columns sent from 63 on the DBMS into the result table. The From statement 64 a.af_apl_id = c.AF_APL_ID tells SAS where to get the data, in this case from the 65 AND a.af_apl_id_sfx = connection to DB2. Within the parentheses are the c.AF_APL_ID_SFX SQL statements that the DBMS will execute, in the 66 AND a.lf_crt_dts_dcb = standard Select, From, Where format. The Macro c.LF_CRT_DTS_DCB variables in the %PUT statement will write any error 76 or c.la_apl_int ^= 0))); codes and messages, sent by the DBMS, to your SAS NOTE: Table WORK.EXPORT created, with 27035622 log. Without these statements, you will not know rows and 35 columns. why your query did not execute. The final two statements disconnect you from the database and exit 77 QUIT; the interface. NOTE: PROCEDURE SQL used: real time 9:59:12.67 When you submit the query in this format, the cpu time 1:05:18.00 following results:

As you can see the inclusion of the Dbkey option reduced the CPU time dramatically from almost 16 1 The SAS System hours to a little over one hour. The real time, while NOTE: Copyright (c) 1999-2001 by SAS Institute Inc., still high at 10 hours, decreased 6 hours. Further Cary, NC, USA. time and performance savings are still possible. NOTE: SAS (r) Release 8.1 (TS1M0) Licensed to PA HIGHER EDUCATION ASSISTANCE Pass-Thru SQL AGENCY NOTE: This session is executing on the AIX 4.3 platform. As you can see in the first two examples, it was necessary to code libname statements to point SAS to the proper database. Using SAS/ACCESS, it is NOTE: SAS initialization used: possible to pass the entire SQL statement to the real time 1.32 seconds DBMS for processing. This method allows the cpu time 0.02 seconds DBMS to determine the best optimization methods to execute. Using SQL pass-thru coding allows DB2 to use its indexes and key structures to join the tables in 12 proc sql; the most efficient fashion. 13 connect to db2(database=mydbase user=XXXXXXX using=XXXXXXXX); To code a pass-thru statement, the SQL statements in 14 CREATE table EXPORT AS traditional SQL appears inside a shell similar to this: 15 select * 16 from connection to db2( Proc SQL; 17 select a.bf_ssn Connect to DB2( Database=mydb); 18 ,a.lc_pcl_rea Create table Test as 21 ,a.la_clm_pri Select * 44 ,c.LD_TRX_EFF From Connection to db2( 45 ,c.lc_rco Select ……. 46 from mydb.dcb_claim a From ……… 47 left outer join Where …….); 48 mydb.dcc_clpkg b %PUT &SQLXRC &SQLXMSG 49 on Disconnect from DB2; 50 a.bf_crt_dts_dcc = b.bf_crt_dts_dcc Quit; 67 and (c.la_apl_pri ^= 0 68 or c.la_apl_int ^= 0)))); You can break the code into several statements. The NOTE: Table WORK.EXPORT created, with 27383541 first is the Proc SQL statement, which invokes the rows and 35 columns. SAS SQL interface. Options associated with Proc SQL are still valid; however you must keep in mind 69 disconnect from db2; where your code is running. The second statement is 70 QUIT; the connection. This statement tells SAS/ACCESS® where the database is, as well as, pass any login ID's 3 NESUG 15 Beginning Tutorials

NOTE: PROCEDURE SQL used: real time 49:13.69 cpu time 25:53.55

The use of Pass-Thru SQL has resulted in an additional timesaving of 30 minutes CPU time and a whopping nine hours of Real time.

Conclusion

As with many SAS procedures there is usually more than one way to accomplish the same result. Querying DB2 systems requires the use of SQL and using Pass-Thru SQL will result in your most efficient use of resources and time.

References

1Kent, Paul (2000), "SQL Joins -- The Long and The Short of It", SAS Technical Note TS-553, Cary, NC: SAS Institute Inc.

Acknowledgment

Thank you to Henry Hill of AES/PHEAA for his valuable assistance in developing this project.

Trademark Information

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are registered trademarks or trademarks of their respective companies.

4