WORKING WITH SUBQUERY IN THE SQL PROCEDURE Lei Zhang, Domain Solutions Corp., Cambridge, MA Danbo Yi, Abt Associates, Cambridge, MA

ABSTRACT blood pressure at each visit recorded by doctors in ® variables PATID, VISID, VISDATE, MDIABP and Proc SQL is a major contribution to the SAS /BASE MSYSBP. The source SAS codes to create the three system. One of powerful features in SQL procedure data sets are listed in the appendix. is subquery, which provides great flexibility in manipulating and querying data in multiple tables simultaneously. However, subquery is the subtlest SUBQUERY IN THE WHERE CLAUSE part of the SQL procedure. Users have to understand the correct way to use subqueries in a We begin with WHERE clause because it is the variety of situations. This paper will explore how most common place to use subqueries. Consider subqueries work properly with different components following two queries: of SQL such as WHERE, HAVING, FROM, Example 1: Find patients who have records in table SELECT, SET clause. It also demonstrates when DEMO but not in table VITAL. and how to convert subqueries into JOINs solutions to improve the query performance. Proc SQL; Select patid demog INTRODUCTION Where patid not in ( patid from vital);

One of powerful features in the SAS SQL procedure Result: is its subquery, which allow a SELECT statement to be nested inside the clauses of another SELECT, or PATID inside an INSERT, UPDATE or DELETE statement, ----- even or inside another subquery. Depending on the clause that contains it, a subquery can select a 150 single value or multiple values from related tables. If more than one subquery is used in a query- expression, the innermost query is evaluated first, then the next innermost query, and so on. PROC Example 2: Find patients who didn’t have their SQL allows a subquery (contained in parentheses) blood pressures measured twice at a visit. to be used at any point in a query expression, but user must understand when and to use Proc SQL; uncorrelated/correlated subqueries, non-scalar and scalar subqueries, or the combinations. This paper select patid from vital T1 will discuss the subquery usage with following where 2>(select count(*) sections: from vital T2 · Subquery in Where clause where T1.patid=T2.patid · Subquery in Having clause and · Subquery in From clause T1.visid=T2.visid); · Subquery in Select clause · Subquery in SET clause Result: · Subquery and JOINs

The examples in this paper refer to an imaginary set PATID of clinical trial data, but the techniques used will ----- apply to other industries as well. There are three 110 SAS data sets in the example database: Data set 140 DEMO contains variables PATID (Patient ID), AGE, and SEX. Data set VITAL contains variables PATID, In the example 1, The SELECT expression in the VISID (Visit ID), VISDATE (Visit date), DIABP and parentheses is an un-correlated subquery. It is an SYSBP (patient’s diastolic/ systolic blood pressure inner query that is evaluated before the outer (main) which is measured twice at a visit). Data set query and its results are not dependent on the VITMEAN contains patient’s mean diastolic/ systolic values in the main query. It is not a scalar subquery This solution is using a double negative structure. because it will return multiple values. The original question can be put in another way: Find each patient for whom no visit exists in which In the example2, the SELECT expression in the the patient concerned has never had. The two parenthesis is a correlated subquery because the nested subqueryies produce a list of visits for whom subquery refers to values from variables T1.PATID a given patient didn’t have. The main query and T1.VISID in a table T1 of the outer query. The presents those patients for whom the result table of correlated subquery is evaluated for each row in the the subquery is empty. SQL procedure determines outer query. With correlated subqueries, PROC SQL for each patient separately whether the subquery executes the subquery and the outer query together. yields no result. The subquery is also a scalar subquery because aggregate function COUNT(*) always returns one In the WHERE clause, IN, ANY, ALL, EXISTS value with one column for each row values from the predicates are allowed to use any type of subqueries outer query at a time. (see example 1 and 3 above); However, only scalar subqueries can be used in relation operators (>, =, Multiple subqueries can be used in the SQL <, etc), BETWEEN, and LIKE predicates. One way WHERE statement. to turn a subquery to into a scalar subquery is using aggregate functions. Consider Example 4 below, Example 3: Find patients who have the maximum which two scalar subqueries are used in the number of visits. BETWEEN predicate.

Proc SQL; Example 4: Find patients whose age is in the average age +/- 5 years. Select distinct patid Proc SQL; from vitmean as a Select patid Where not exists from demog (select visid from vitmean where age between Except (select Avg(age) Select visid from vitmean as b from demog)-5 Where a.patid=b.patid); and (select avg(age) Result: from demog)+5; Result: PATID ----- PATID 120 ------110

Below is a nested correlated subquery solution to example 3. SUBQUERY IN THE HAVING CLAUSE

Proc SQL; Users can use one or more subqueries in the Having clause as long as they do not produce multiple Select distinct patid from vitmean as a values. It means that the most outer subquery in the Having clause must be scalar subquery, either where not exists uncorrelated or correlated. The following query is (select distinct visid an example, from vitmean as b where not exists Example 4: Find patients who didn’t have maximun (select visid number of visits. from vitmean as c Proc SQL; where a.patid=c.patid Select patid and From vitmean as a b.visid=c.visid)); Group by patid Having count(visid) from vital <(select where visid>'1') as a, max(cnt) (select distinct patid, from visid, visdate (select from vital count(visid) where visid<'4') as b as cnt from vitmean where(input(a.visid,2.)-1 patid) =input(b.visid,2.)) ); and a.patid=b.patid a.patid, a.visid; Result: Result: PATID ------PATID VISID INTERVAL 100 ------110 100 2 7 130 100 3 7 140 120 2 8 120 3 6 The scalar subquery in this command counts the 120 4 34 maximum number of visits first, and then main 130 2 35 query find the patient whose number of visits is not equal to the maximun value calculated by the 130 3 38 subquery. For a correlated scalar subquery in the 140 2 9 Having clause, see example 6 in next section. In the example 5, two subqueries are joined together SUBQUERY IN THE FROM CLAUSE to form an in-line view in the From clause. The first one has all required patient visit data with visit ID Subqueries can be used in the SQL FROM clause to >1, and the second one has all required patient visit constitute in-line views in the SQL procedure data with visit ID <4. They are joined by PATID and because subqueries can be regarded as temporary VISID to calculate the interval between this visit and tables. You will notice the usage in the example 4, in previous visit. The result is shown above. which subquery in the From clause is actually an in- line view. In-line views are views in the FROM Example 6 below is more complicated case, in clause. They are extended temporary tables which which two subqueries are joined together by visid are constructed by UNION, INTERSECT, EXCEPT, and patid to from an in-line view. The results from and JOINs based on tables and subqueries. In-line the in-line view are then used by main (outer) query view can not correlate with its main query, but outer to obtain the list of patients who have exact same (main) query can refer any values in the tables and visits. subqueries which constitute the in-line view. The Example 6: Find patients who have exact same main advantages of using in-line view is improving visits. the SQL performance and reducing the SQL statements. Consider Example 5. proc ; Example 5: Calculate the interval between this visit create table vital as and previous visit for each patient visit. select distinct patid, visid proc sql; from vital; select a.patid, a.visid, select T1.PATID as PATID, intck('day', b.visdate, a.visdate) T2.PATID as PATID1 as interval FROM from (select patid, visid from VITAL) (select distinct patid, AS T1 visid, visdate inner (select patid, visid from VITAL) Result: AS T2 on T1.visid=T2.visid PATID VISID DIFF and T1.patid < T2.patid 100 1 0 Group BY T1.PATID, T2.PATID 100 2 0 Having count(*)=(Select count(*) 100 3 0 From vital as T3 110 1 0 Where 110 3 0 T3.PATID=T1.PATID) 120 1 0 and 120 2 34 count(*)= (Select Count(*) 120 3 7.5 From Vital as T4 120 4 0 130 1 -0.5 Where 130 2 -0.5 T4.PATID=T2.PATID); 130 3 -0.5 RESULT: 140 1 0 140 2 0 PATID PATID1 ------100 130 The subquery in the SQL clause is a correlated scalar subquery. It will calculate average systolic blood pressure based on VITAL table for each SUBQUERY IN THE SELECT CLAUSE patient visit from table VITMEAN. The main query then use the result to calculate the difference between recorded systolic blood pressure and SAS SQL can use a scalar subquery or even a calculated systolic blood pressure for each patient correlated scalar subquery in the SELECT clause. If visit. The process will be repeated until all patients the subquery results are an empty, the result is a in table VITMEAN are checked. missing value. This feature lets you write a LEFT JOIN as a query within SELECT clause. Here is an example. SUBQUERY IN THE SET CLAUSE Example 7: Compute the difference between recorded mean of SYSBP from VITMEAN table and You can use scalar subquery and/or correlated calculated mean of SYSBP from VITAL for each scalar subquery in the SET clause of the Update patient and visit. statement to modify a table based on itself or another table. Here is an example Proc SQL; Select a.patid, a.visid, Example 8: Modify the mdiabp and msysbp in the VITMEAN using values calculated from table Vital; a.msysbp - (select avg(sysbp) proc sql; from vital as b UPDATE VITMEAN as a where a.patid=b.patid and Set mdiabp=(select avg(b.diabp) a.visid=b.visid) from VITAL as b as diff where a.patid=b.patid from vitmean as a; and a.visid=b.visid), msysbp=(select avg(c.sysbp) from VITAL as c wherea.patid=c.patid and a.visid=c.visid); SUBQUERY AND JOINS CONCLUSIONS We summarize the usages of subquery in the Some users find subquery difficult to read, and different SQL components as follows: consequently fear that subqueries are difficult to use. Indeed, many subqueries can be replaced by SQL Usage one or more join queries. However, if you can’t Components construct an equivalent join query, the only Relation Scalar uncorrealted/correlated alternative will be to use a series of queries and Operator subquery pass the results from one query to another. This is (<, =, >, etc) obviously less efficient than using subqueries. Between, Scalar uncorrealted/correlated Subqueries can perform wonders of selection and Like subquery comparison so that one set of statements and nested subqueries can replace several non- IN, ANY, ALL, Uncorrelated/Correlated subquery subquery statements. This imporove efficiency and SOME, Exists could even effect performance. FROM Clause Uncorrelated subquery Having Scalar uncorrealted/correlated Generally whether using a join or a subquery is Clause subquery simply a matter of individual preference: they can Select Scalar uncorrealted/correlated often be used interchangeably to solve a given Clause subquery problem. Users will find that subqueries exceed the SET clause Scalar uncorrealted/correlated capabilities of join for certain queries. In some subquery cases, a subquery is the shortest route to the results a user need. However, if users can find an SAS and SAS/BASE mentioned in this paper are equivalent join query for a subquery, it can replace registered trademarks or trademarks of SAS the subquery to improve the performance, For Institute Inc. in the USA and other countries. example, the JOIN solution to the example 1 is: ® indicates USA registration. proc SQL; REFERENCES select a.patid from demog as a SAS Institute, Inc. (1992), SAS Guide to the SQL left join Procedure: Usage and Reference, Version 6, First (select distinct patid Edition from vital) as b SAS Institute, Inc. (1994), SAS Technical Report P- 222, Changes and Enhancements to Base SAS on a.patid=b.patid Software, Release 6.07. Getting started with the where b.patid is NULL; SQL, Procedure, First Edition and the JOIN solution to Example 7 is. AUTHOR’S ADDRESS

Proc SQL; Lei Zhang select a.patid, b.visid, DomianPharma Inc. max(msysbp)-avg(b.sysbp) 10 Maguire Road Lexington, MA 02140, USA as diff Tel: 781-778-3880 from vitmean as a Fax: 781-778-3700 left join e-mail: [email protected] vital as b on a.patid=b.patid Danbo Yi Abt Associates Inc. and a.visid=b.visid 55 Wheeler Street group by a.patid, b.visid; Cambridge, MA 02138-1168, USA Tel: 617-349-2346 In some situation, it may be worthwhile to look into Fax: 617-520-2940 rewriting such subqueries as JOINs, but the e-mail: [email protected] development and maintenance of such codes may have them less desirable than subqueries because JOINs solution may improve the performance at the cost of losing readability. APPENDIX: EXAMPLE SAS DATA SETS: run;

Data demog; data vitmean; input patid $1-3 age 5-6 sex $ 8; input patid $ datalines; visid $ 100 34 M visdate mmddyy8. 110 45 F mdiabp 120 67 M msysbp; 130 55 M format visdate mmddyy10.; 140 40 F cards; 150 38 M 100 1 01/12/97 67 110 ; 100 2 01/19/97 66 110 run; 100 3 01/26/97 68 120 110 1 02/03/97 68 110 data vital; 110 3 02/17/97 66 115 input patid $ 120 1 04/05/97 65 105 visid $ 120 2 04/12/97 64 107 visdate mmddyy10. 120 3 04/19/97 65 105 diabp 120 4 05/23/97 91 107 sysbp; 130 1 02/12/97 66 110 format visdate mmddyy10.; 130 2 03/19/97 66 109 datalines; 130 3 04/26/97 68 119 100 1 01/12/97 66 110 140 1 02/03/97 68 110 100 1 01/12/97 68 110 140 2 02/12/97 66 115 100 2 01/19/97 66 . ; 100 2 01/19/97 66 110 run; 100 3 01/26/97 68 120 100 3 01/26/97 68 120 110 1 02/03/97 68 110 110 3 02/12/97 64 115 110 3 02/12/97 68 115 120 1 04/05/97 66 105 120 1 04/05/97 64 105 120 2 04/13/97 110 64 120 2 04/13/97 105 82 120 3 04/19/97 63 105 120 3 04/19/97 105 90 120 4 05/23/97 90 106 120 4 05/23/97 92 108 130 1 02/12/97 66 111 130 1 02/12/97 67 110 130 2 03/19/97 66 109 130 2 03/19/97 66 110 130 3 04/26/97 68 121 130 3 04/26/97 68 118 140 1 02/03/97 68 110 140 2 02/12/97 64 115 140 2 02/12/97 68 115 ;