Effective Search In A Normalized Application

Effective search in a normalized application By Gints Plivna

Search for your data without expensive joins (each time) Our latest application was an OLTP and normalized one.1 Application collected data about people lifecycle – birth, name, citizenship, address, marriage etc. Looking at these objects it is obvious that mostly this application suffers from changes – upon birth instance of a person is created and then all his life he faster or slower is making changes in above mentioned person properties. So normalized data model is perfect for recording changes, not so perfect for receiving all information for one given person and bad for searching for a person whose surname contains three characters “a”, is born somewhere between 1960 and 1970, is living in a big city and his father’s name is “John”. As you can imagine none of these conditions are restrictive enough although combined they give rather small result set. Unfortunately our application had to support such almost ad hoc queries.

Sizing up the situation Initially almost every search performed in the same way – joined together necessary tables, optimizer picked the right plan and voila! Unfortunately the same query plan is used for both following queries:  give me person whose surname is like “Plivna%” and citizenship “Latvia”;  give me person whose surname starts with “A%” and citizenship “Vatican”. Keeping in mind that  application was developed in Latvia and people having surname “Plivna” are below 100 on one hand,  but Vatican has about 1 000 people at all and you can imagine yourself how many people surnames start with “A” on the other hand, it is obvious that at least in one case query plan is very inefficient. Above mentioned application was transactional and had optimizer_mode = FIRST_ROWS_1. So due to above mentioned searches it suffered from never ending nested loops (and full scans wouldn’t be better anyway). The first step towards solution was to distinguish two types of queries:  having accurate conditions (like surname “Plivna”, document number 123456789, birth date 1960.01.01;  having many inaccurate conditions hopefully resulting in small result set. Accurate queries could be easily answered without any big extra effort. Index access resulting in low cardinality possible rows without join conditions worked very well on already created data model. Hardest part was to effectively answer query like given in introduction of this paper.

1 You can look at whitepaper about normalized databases written by Jared Still. It can be downloaded in his website http://jaredstill.com/articles.html. Bright idea – materialized views Situation seems like a perfect case for materialized views? Just my initial thoughts! But the road to the final solution was rather long. According to Oracle documentation materialized views can be refreshed either on commit or making complete refresh.2 Of course the first thought was – on commit, yes that’s it. Unfortunately in our real case on commit was too slow. In a small test case the response time was ~10 times slower than without materialized views. Then I traced the update. Looking at trace files without materialized views and with them I simply understood that it is not for us. It’s about time to show you the first example to feel the difference. Tables in this example will be used throughout this document.

DROP MATERIALIZED VIEW pers_full_info;3 DROP TABLE person_citizenships; DROP TABLE person_births; DROP TABLE person_addresses; DROP TABLE persons;

CREATE TABLE persons ( prs_id NUMBER NOT NULL, prs_name VARCHAR2(20) NOT NULL, prs_surname VARCHAR2(20) NOT NULL);

CREATE TABLE person_addresses ( adr_id NUMBER NOT NULL, adr_prs_id NUMBER NOT NULL, adr_country VARCHAR2(20) NOT NULL, adr_city VARCHAR2(20) NOT NULL, adr_street VARCHAR2(20), adr_house_numb NUMBER, adr_start_date DATE NOT NULL, adr_end_date DATE);

CREATE TABLE person_births ( brt_id NUMBER NOT NULL, brt_prs_id NUMBER NOT NULL, brt_date DATE NOT NULL, brt_start_date DATE NOT NULL, brt_end_date DATE);

CREATE TABLE person_citizenships ( ctz_id NUMBER NOT NULL, ctz_prs_id NUMBER NOT NULL, ctz_citizenship VARCHAR2(10) NOT NULL, ctz_start_date DATE NOT NULL, ctz_end_date DATE);

INSERT ALL WHEN 1=1 THEN INTO persons VALUES ( rn, substr(object_name, 1, 20), substr(object_type, 1, 20) ) 2 Full info in Oracle’s website http://download- east.oracle.com/docs/cd/B19306_01/server.102/b14223/basicmv.htm#sthref521. 3 All scripts were tested on both 9.2 and 10.2 Oracle databases. -- all people have at least 1 address WHEN 1=1 THEN INTO person_addresses VALUES ( rn, rn, substr(object_name, 1, 20), substr(object_name, 1, 20), substr(object_name, 1, 20), mod(rn, 15), sysdate - mod(rn, 300), decode(mod(rn, 3), 0, NULL, sysdate - mod(rn, 30))) -- 2/3rds people have 2 addresses WHEN mod(rn, 3) > 0 THEN INTO person_addresses VALUES ( rn + 100000, rn, substr(object_name, 5, 19) || 'X', substr(object_name, 5, 19) || 'X', substr(object_name, 5, 19) || 'X', mod(rn, 5), sysdate - mod(rn, 30) + 1, NULL) -- all people have one record for birth WHEN 1=1 THEN INTO person_births VALUES ( rn, rn, sysdate - 3000 - mod(rn,3000), sysdate - mod(rn, 300), decode(mod(rn, 3), 0, NULL, sysdate - mod(rn, 30))) -- 2/3rds people have 2 records for birth WHEN mod(rn, 3) > 0 THEN INTO person_births VALUES ( rn + 100000, rn, sysdate - 2000 - mod(rn,3000), sysdate - mod(rn, 30) + 1, NULL) -- all people have at least one citizenship WHEN 1=1 THEN INTO person_citizenships VALUES ( rn, rn, substr(object_name, 1, 10), sysdate - mod(rn, 300), decode(mod(rn, 3), 0, NULL, sysdate - mod(rn, 30))) -- 2/3rds people have 2 citizenships WHEN mod(rn, 3) > 0 THEN INTO person_citizenships VALUES ( rn + 100000, rn, substr(object_name, 10, 9) || 'X', sysdate - mod(rn, 30) + 1, NULL) SELECT rownum rn, all_objects.* FROM all_objects;

ALTER TABLE persons ADD CONSTRAINT prs_pk PRIMARY KEY (prs_id); ALTER TABLE person_births ADD CONSTRAINT brt_pk PRIMARY KEY (brt_id); ALTER TABLE person_addresses ADD CONSTRAINT adr_pk PRIMARY KEY (adr_id); ALTER TABLE person_citizenships ADD CONSTRAINT ctz_pk PRIMARY KEY (ctz_id);

ALTER TABLE person_births ADD CONSTRAINT brt_prs_fk FOREIGN KEY (brt_prs_id) REFERENCES persons(prs_id); ALTER TABLE person_addresses ADD CONSTRAINT adr_prs_fk FOREIGN KEY (adr_prs_id) REFERENCES persons(prs_id); ALTER TABLE person_citizenships ADD CONSTRAINT ctz_prs_fk FOREIGN KEY (ctz_prs_id) REFERENCES persons(prs_id); CREATE INDEX brt_prs_idx on person_births (brt_prs_id); CREATE INDEX adr_prs_idx on person_addresses (adr_prs_id); CREATE INDEX ctz_prs_idx on person_citizenships (ctz_prs_id); We have created now the basic environment to start experimenting. Here is snippet from SQL*Plus screen, what it takes to insert a row in person_citizenships table and what takes to insert a bunch of rows (just to be sure there is no problem with persons table we’ll do set insert 2 times):

SQL> set timing on SQL> INSERT INTO person_citizenships VALUES (0, 1, '1ROWINSERT', sysdate - 1000, sysdate);

1 row created.

Elapsed: 00:00:00.00 SQL> commit;

Commit complete.

Elapsed: 00:00:00.00 SQL> SELECT max(ctz_id) FROM person_citizenships;

MAX(CTZ_ID) ------160011

Elapsed: 00:00:00.00 SQL> INSERT INTO person_citizenships 2 SELECT rownum + 160011, mod(rownum, 100) + 1, 'MANYROWS', sysdate - 1000, sysdate 3 FROM persons;

60011 rows created.

Commit complete.

MAX(CTZ_ID) ------220022

Elapsed: 00:00:00.00 SQL> INSERT INTO person_citizenships 2 SELECT rownum + 220022, mod(rownum, 100) + 1, 'MANYROWS', sysdate - 1000, sysdate 3 FROM persons;

60011 rows created.

Elapsed: 00:00:01.09 SQL> commit; Commit complete.

Elapsed: 00:00:00.00

Now let’s create materialized view logs and materialized view and do the same inserts.

DROP MATERIALIZED VIEW LOG ON persons; DROP MATERIALIZED VIEW LOG ON person_births; DROP MATERIALIZED VIEW LOG ON person_addresses; DROP MATERIALIZED VIEW LOG ON person_citizenships;

CREATE MATERIALIZED VIEW LOG ON persons WITH ROWID, PRIMARY KEY; CREATE MATERIALIZED VIEW LOG ON person_births WITH ROWID, PRIMARY KEY; CREATE MATERIALIZED VIEW LOG ON person_addresses WITH ROWID, PRIMARY KEY; CREATE MATERIALIZED VIEW LOG ON person_citizenships WITH ROWID, PRIMARY KEY;

CREATE MATERIALIZED VIEW pers_full_info1 BUILD IMMEDIATE REFRESH FAST ON COMMIT ENABLE QUERY REWRITE AS SELECT persons.rowid prs_rowid, prs_id, prs_name, prs_surname, person_addresses.rowid adr_rowid, adr_id, adr_country, adr_city, adr_street, adr_house_numb, adr_start_date, adr_end_date, person_births.rowid brt_rowid, brt_id, brt_date, brt_start_date, brt_end_date, person_citizenships.rowid ctz_rowid, ctz_id, ctz_prs_id, ctz_citizenship, ctz_start_date, ctz_end_date FROM persons, person_births, person_addresses, person_citizenships WHERE prs_id = adr_prs_id AND prs_id = ctz_prs_id AND prs_id = brt_prs_id;

SQL> INSERT INTO person_citizenships VALUES (-1, 1, '1ROWINSERT', sysdate - 1000, sysdate);

1 row created.

Commit complete.

MAX(CTZ_ID) ------280033

Elapsed: 00:00:00.00 SQL> INSERT INTO person_citizenships 2 SELECT rownum + 280033 , mod(rownum, 100) + 1, 'MANYROWS', sysdate - 1000, sysdate 3 FROM persons;

60011 rows created.

Commit complete.

Elapsed: 00:00:05.02

Ooops something interesting happened. Somehow all timings have risen up. Following table summarizes changes. Measure Timing without MV Timing with MV One row insert 0.00 0.00 One row commit 0.00 0.01 60011 rows insert 1.09 8.01 60011 rows commit 0.00 5.02

So extra work has to be done both for inserts and commits. Amount of extra work can be understood tracing the simple one row insert. Tkprof’ed trace file of one row insert along with commit showed following lines (remember this was 4 tables join):

4 user SQL statements in trace file. 153 internal SQL statements in trace file. 157 SQL statements in trace file. 57 unique SQL statements in trace file.

Actually extra work of course depends on count of joined tables. Following table summarizes measurements for 5, 4, 3, 2 table joins and 1 table (a simple view over it). In parenthesis is difference between current measure and one table less join. I cannot explain why “user SQL statements” differ and why “unique SQL statements” fluctuate because I didn’t look through all trace files, but I think the trend can be seen – the more joined tables the more SQL statements to execute.

Measure 5 tables 4 tables 3 tables 2 tables 1 table User SQL statements 5 4 4 5 6 Measure 5 tables 4 tables 3 tables 2 tables 1 table internal SQL statements 197 (44) 153 (32) 121 (22) 99 (35) 64 SQL statements 202 (45) 157 (32) 125 (21) 104 (34) 70 unique SQL statements 71 57 55 63 57

Let’s’ return to our story. As a result of complete refresh it was clear that customer has to accept search on stale data. Of course original alternative – searches slowing down all the system was even worse. Next task was to ensure that always there is possibility to search using a materialized view and the only solution we found was to create 2 identical materialized views that can be refreshed in rotation. The complete refresh process involved more than simple alternate refresh of both materialized views. It is explained in the next chapter.

Alternate refresh process of materialized views Imagine there are two identical materialized views MA and MB. The process to refresh MA is as follows: 1. Statistics on MB and its indexes are set. Query rewrite happens only because of statistics and CBO that calculates that rewritten query costs less and consequently runs faster. So statistics on materialized views are required for query rewrite.4 Statistics both for materialize view and its indexes are calculated only once, stored in a separate table and set and unset using package dbms_stats. 2. Statistics on MA and its indexes are deleted. As we know from previous step statistics are necessary for query rewrite so after this step every possible query is rewritten on MB. 3. There is break for 10 minutes. This break is necessary to ensure that all queries that are possibly rewritten for MA are completed and MA can be refreshed without any problem. 4. All indexes on MA are dropped. If there is another query running on MA …. 5. MA is refreshed using dbms_mview.refresh. 6. All indexes on MA are created. Process that involves 3 steps 1) drop all indexes, 2) refresh materialized view, 3) create all indexes is much faster than simple refresh without index drop and recreate because maintaining indexes during MV create is expensive. 7. Statistics on MA and its indexes are set. 8. Statistics on MB and its indexes are deleted. From now on query rewrite will happen only for just now refreshed materialized view MA. Above described refresh process has following characteristics: 1) Because of two identical materialized views queries always can be rewritten to most recently refreshed one. 2) Two identical materialized views require two times more space than one of course. That is the price for uninterrupted query rewrite process. 3) It requires as little work as possible i.e. indexes are dropped and then recreated and statistics are not recalculated because in reality these change very little over time. 4) Full scan of all base tables is necessary each time materialized view is refreshed. That means system should not be already overloaded because refresh process needs significant

4 How statistics affects query rewrite http://download- east.oracle.com/docs/cd/B19306_01/server.102/b14223/qradv.htm#i1006458 resources. We in our project had enough resources for one resource eager refresh process, but not enough for tens a little bit less eager search processes.

The very search mechanism Search mechanism used was divided into 2 parts: 1. Finding person identifiers that correspond to search criteria using materialized views. 2. Finding identification information about persons using identifiers found in the first step in actual tables. Two step process adds following characteristics, which according to your needs may be either benefits or shortages:  Two step process clearly separates tables that are used for search criteria and tables that are used just to get identifying information about person. It is clear that databases are intended for joins and ideally you’d join together both tables necessary for searching and tables necessary for result display. In our case it was a design question where several different queries where used to get resulting IDs and then one common query to get data to display on user’s screen.  One can search information in a bit stale environment but only the most actual information is shown. It avoids user confusion in case he watches the result set, then drills down into detailed information and then discovers that it is different than shown in search results. Although of course you have to pay for that anyway because now user can discover that the data he found probably don’t satisfy search criteria. So confusion is possible in any case and users should be a priori warned about that. So what to watch out in the first step? One has to remember at least following things when creating materialized join views for searching:  Joined table set should be minimal and satisfy your minimal needs. Suppose you have tables A, B and C joined in materialized view M. Assuming all other requirements are satisfied you can rewrite join of 4 tables A, B, C and D using M and D. But you cannot rewrite join of two tables A and B.  Child tables should be joined via outer join unless you are completely sure that each parent has at least one child in particular table. Otherwise you’ll exclude possible valid data.  Some criteria to eliminate historical data should be set. Suppose you have parent table persons and on average each adult has 2 parents, 5 addresses, 1.5 marriages, 3 identification documents, 1.5 citizenships and 2 children. As a result in your MV you’ll have 2*5*1.5*3*1.5*2 = 135 rows for each adult. Storing only last address, most recent identification document and marriage will get you back to much more comfortable 2*1*1*1*1.5*2 = 6 rows.  If you set any criteria for outer joins, then write queries intended for query rewrite very carefully. It seems that Oracle can use only text match rewrite for such queries and even the smallest change in predicate order, even some more predicates definitely not changing result set, even some additional comments prevents using query rewrite. There are examples and possible workaround below how to ensure query rewrite anyway. So let’s look at some examples. First example describes the first bullet above. Here is 3 table outer join materialized view and 2 tables join search query that we’d try to rewrite. DROP MATERIALIZED VIEW pers_full_info1; CREATE MATERIALIZED VIEW pers_full_info1 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT prs_id, prs_name, prs_surname, adr_id, adr_country, adr_city, adr_street, adr_house_numb, adr_start_date, adr_end_date, ctz_id, ctz_prs_id, ctz_citizenship, ctz_start_date, ctz_end_date FROM persons, person_addresses, person_citizenships WHERE prs_id = ctz_prs_id (+) AND prs_id = adr_prs_id (+); exec dbms_stats.gather_table_stats(user, ‘pers_full_info1’, cascade =>TRUE); exec dbms_stats.gather_table_stats(user, ‘persons’, cascade =>TRUE); exec dbms_stats.gather_table_stats(user, ‘person_citizenships’, cascade=>TRUE); exec dbms_stats.gather_table_stats(user, ‘person_addresses’, cascade=>TRUE); exec dbms_stats.gather_table_stats(user, ‘person_births’, cascade=>TRUE);

SQL> set autotrace traceonly explain SQL> SELECT /*+ rewrite */ 2 prs_id, 3 prs_name, 4 prs_surname, 5 adr_id, 6 adr_country, 7 adr_city, 8 adr_street, 9 adr_house_numb, 10 adr_start_date, 11 adr_end_date 12 FROM persons, person_addresses 13 WHERE prs_id = adr_prs_id (+) 14 AND rownum <2 15 /

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=4 Card=1 Bytes=108) 1 0 COUNT (STOPKEY) 2 1 NESTED LOOPS (OUTER) (Cost=4 Card=1 Bytes=108) 3 2 TABLE ACCESS (FULL) OF 'PERSONS' (Cost=2 Card=1 Bytes=31) 4 2 TABLE ACCESS (BY INDEX ROWID) OF 'PERSON_ADDRESSES' (C ost=2 Card=1 Bytes=77) 5 4 INDEX (RANGE SCAN) OF 'ADR_PRS_IDX' (NON-UNIQUE) (Co st=1 Card=1) As you can see there wasn’t query rewrite even with hint REWRITE that according to Oracle documentation “instructs the optimizer to rewrite a query in terms of materialized views, when possible, without cost consideration”5. And now the same situation with the same 3 table join and 4th joined table as well. SQL> SELECT 2 prs_id, 3 prs_name, 4 prs_surname, 5 adr_country, 6 ctz_citizenship, 7 brt_date 8 FROM persons, person_addresses, person_citizenships, person_births 9 WHERE prs_id = ctz_prs_id (+) 10 AND prs_id = adr_prs_id (+) 11 AND prs_id = brt_prs_id (+) 12 AND rownum < 2 13 /

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=4 Card=1 Bytes=84) 1 0 COUNT (STOPKEY) 2 1 NESTED LOOPS (OUTER) (Cost=4 Card=1 Bytes=84) 3 2 TABLE ACCESS (FULL) OF 'PERS_FULL_INFO1' (Cost=2 Card= 1 Bytes=62) 4 2 TABLE ACCESS (BY INDEX ROWID) OF 'PERSON_BIRTHS' (Cost =2 Card=1 Bytes=22) 5 4 INDEX (RANGE SCAN) OF 'BRT_PRS_IDX' (NON-UNIQUE) (Co st=1 Card=1) So for 4 joins you can reuse existing MV and additionally join only 4th table without any additional hints just because it costs less than without query rewrite. Cost without query rewrite you can check adding hint NOREWRITE, but in my case it was 8. Next example will show elimination of historical rows. Remember in chapter “Bright idea – materialized views” I generated more than one record for each of tables person_addresses, person_citizenships, person_births but only one last record describes current situation i.e. where end_date IS NULL. At first let’s look how many rows we have got for each type of queries: set autotrace off SQL> SELECT count(*) 2 FROM persons, person_citizenships, person_births, person_addresses 3 WHERE prs_id = adr_prs_id (+) 4 AND prs_id = brt_prs_id (+) 5 AND prs_id = ctz_prs_id (+);

COUNT(*) ------340187

SQL> SELECT count(*) 2 FROM persons, person_citizenships, person_births, person_addresses 3 WHERE prs_id = adr_prs_id (+) 4 AND prs_id = brt_prs_id (+) 5 AND prs_id = ctz_prs_id (+)

5 See REWRITE hint http://download- east.oracle.com/docs/cd/B19306_01/server.102/b14200/sql_elements006.htm#SQLRF50503 6 AND adr_end_date (+) IS NULL 7 AND brt_end_date (+) IS NULL 8 AND ctz_end_date (+) IS NULL;

COUNT(*) ------60033

SQL> SELECT count(*) FROM persons;

COUNT(*) ------60033 60033 is the same number for persons as well as join for all 4 tables. Without any restriction the number is much higher. Now we’ll create the final materialized view that will be used for searching including only columns without duplicates (e.g. only prs_id and not adr_prs_id) and without columns of explicitly known values (e.g. end_date) DROP MATERIALIZED VIEW pers_full_info1; CREATE MATERIALIZED VIEW pers_full_info1 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT prs_id, prs_name, prs_surname, adr_id, adr_country, adr_city, adr_street, adr_house_numb, adr_start_date, ctz_id, ctz_citizenship, ctz_start_date, brt_id, brt_date, brt_start_date FROM persons, person_addresses, person_citizenships, person_births WHERE prs_id = adr_prs_id (+) AND prs_id = brt_prs_id (+) AND prs_id = ctz_prs_id (+) AND adr_end_date (+) IS NULL AND brt_end_date (+) IS NULL AND ctz_end_date (+) IS NULL; exec dbms_stats.gather_table_stats(user, ‘pers_full_info1’, cascade =>TRUE); exec dbms_stats.gather_table_stats(user, ‘persons’, cascade =>TRUE); exec dbms_stats.gather_table_stats(user, ‘person_citizenships’, cascade=>TRUE); exec dbms_stats.gather_table_stats(user, ‘person_addresses’, cascade=>TRUE); exec dbms_stats.gather_table_stats(user, ‘person_births’, cascade=>TRUE); Now let’s see the reason why I warned so much about using exactly the same query text.6 First example uses the same query text starting from FROM clause and succeeds in query rewrite. Set autotrace traceonly explain SQL> SELECT count(*)

6 More about types of query rewrite in Oracle docs http://download- east.oracle.com/docs/cd/B19306_01/server.102/b14223/qradv.htm#BABFDGFG 2 FROM persons, person_addresses, person_citizenships, person_births 3 WHERE prs_id = adr_prs_id (+) 4 AND prs_id = brt_prs_id (+) 5 AND prs_id = ctz_prs_id (+) 6 AND adr_end_date (+) IS NULL 7 AND brt_end_date (+) IS NULL 8 AND ctz_end_date (+) IS NULL 9 /

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=187 Card=1) 1 0 SORT (AGGREGATE) 2 1 TABLE ACCESS (FULL) OF 'PERS_FULL_INFO1' (Cost=187 Card= 99487) Second example just adds an absolutely innocent comment. SQL> SELECT count(*) 2 FROM persons, person_addresses, person_citizenships, person_births 3 WHERE prs_id = adr_prs_id (+) 4 AND prs_id = brt_prs_id (+) 5 AND prs_id = ctz_prs_id (+) 6 AND adr_end_date (+) IS NULL -- some absolutely innocent comments 7 AND brt_end_date (+) IS NULL 8 AND ctz_end_date (+) IS NULL 9 /

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=504 Card=1 Bytes=45) 1 0 SORT (AGGREGATE) 2 1 HASH JOIN (OUTER) (Cost=504 Card=60033 Bytes=2701485) 3 2 HASH JOIN (OUTER) (Cost=370 Card=60033 Bytes=1380759) 4 3 HASH JOIN (OUTER) (Cost=249 Card=60033 Bytes=840462) 5 4 TABLE ACCESS (FULL) OF 'PERSONS' (Cost=48 Card=600 33 Bytes=300165) 6 4 TABLE ACCESS (FULL) OF 'PERSON_ADDRESSES' (Cost=17 3 Card=60033 Bytes=540297) 7 3 TABLE ACCESS (FULL) OF 'PERSON_CITIZENSHIPS' (Cost=7 7 Card=60033 Bytes=540297) 8 2 TABLE ACCESS (FULL) OF 'PERSON_BIRTHS' (Cost=77 Card=4 7664 Bytes=1048608) Bahhhh! It’s all over!7 No more query rewrite! Ok probably you can go without comments – you can add them before query for example, but what to do with additional predicates, because even predicates like rownum < X OR 1 = 1 will suspend succesful query rewrite. SQL> SELECT count(*) 2 FROM persons, person_addresses, person_citizenships, person_births 3 WHERE prs_id = adr_prs_id (+) 4 AND prs_id = brt_prs_id (+) 5 AND prs_id = ctz_prs_id (+) 6 AND adr_end_date (+) IS NULL 7 AND brt_end_date (+) IS NULL 8 AND ctz_end_date (+) IS NULL 9 AND 1 = 1 10 /

Execution Plan ------

7 In 10.2 one can use comments. All other mentioned restrictions still apply. 0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=504 Card=1 Bytes=45) 1 0 SORT (AGGREGATE) 2 1 HASH JOIN (OUTER) (Cost=504 Card=60033 Bytes=2701485) 3 2 HASH JOIN (OUTER) (Cost=370 Card=60033 Bytes=1380759) 4 3 HASH JOIN (OUTER) (Cost=249 Card=60033 Bytes=840462) 5 4 TABLE ACCESS (FULL) OF 'PERSONS' (Cost=48 Card=600 33 Bytes=300165) 6 4 TABLE ACCESS (FULL) OF 'PERSON_ADDRESSES' (Cost=17 3 Card=60033 Bytes=540297) 7 3 TABLE ACCESS (FULL) OF 'PERSON_CITIZENSHIPS' (Cost=7 7 Card=60033 Bytes=540297) 8 2 TABLE ACCESS (FULL) OF 'PERSON_BIRTHS' (Cost=77 Card=4 7664 Bytes=1048608) The solution is rather simple – just include the original query in the subquery and apply necessary additional predicates in outer query. So let’s see how we can get all persons living in address that looks like ‘USER%’. We’ll create also an appropriate CREATE INDEX prs_full_idx1 ON pers_full_info1 (adr_country); exec dbms_stats.gather_table_stats(user, ‘pers_full_info1’, cascade =>TRUE);

SQL> SELECT prs_id FROM ( 2 SELECT prs_id, adr_country 3 FROM persons, person_addresses, person_citizenships, person_births 4 WHERE prs_id = adr_prs_id (+) 5 AND prs_id = brt_prs_id (+) 6 AND prs_id = ctz_prs_id (+) 7 AND adr_end_date (+) IS NULL 8 AND brt_end_date (+) IS NULL 9 AND ctz_end_date (+) IS NULL 10 ) 11 WHERE adr_country LIKE 'USER%' 12 /

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=3 Card=1 Bytes=22) 1 0 TABLE ACCESS (BY INDEX ROWID) OF 'PERS_FULL_INFO1' (Cost=3 Card=1 Bytes=22) 2 1 INDEX (RANGE SCAN) OF 'PRS_FULL_IDX1' (NON-UNIQUE) (Cost =2 Card=2) So we can see that additional predicate has been merged into subquery and even can use just now created index! Of course the next question is - what is the gain using materialized view instead of just joining all tables? Creating similar index on adr_country column in table person_addresses and using the same select with hint norewrite I got 1652 consistent gets vs 189 consistent gets using materialized view. With materialized view I got more than 8 times less consistent gets. The possible carcass for final search query is as follows: SELECT FROM ( ) WHERE ;  - depends on your needs whether you need duplicates or not. If you think you cannot get duplicates in your materialized view query – better check once more, whether it is so.  - either only ids, or some other descriptive information if you can get it out from the materialized view.  - query that was the basis for materialized view starting with SELECT and ending with the last predicate without any modifications except whitespaces and new lines. Comments in this context are also modifications.  - actual search conditions you’d like to apply. As a result tuning search queries becomes much more easier i.e. creating appropriate (composite) indexes for just one table – materialized view.

Answering the initial query Initial query described in chapter “Sizing up the situation” hasn’t of course changed its ugliness. Even with materialized views and query rewrite in one of the cases query plan would be inefficient. But the big difference is that you wouldn’t need to make never ending nested loops from table A to table B and then to table C and … then to table Z finally to find out that initially picked row doesn’t satisfy necessary criteria. Everything you need is in one table (materialized view) and you can create necessary (composite) indexes for most common queries. Also queries that return big result set and/or haven’t possible effective index access i.e. queries that need full table scan can now scan only one table instead of n tables and performing n-1 (most probably hash) joins.

Privileges and settings necessary to use query rewrite for the same user objects CREATE MATERIALIZED VIEW – it isn’t part of standard RESOURCE role, so you need this privilege explicitly granted to create materialized view.

QUERY_REWRITE_ENABLED – parameter you can set either in init file or alter system or alter session. In 9.2.0 and below default value is FALSE, that’s probably the reason why your query isn’t rewritten.

QUERY_REWRITE_INTEGRITY – parameter you can set either in init file or alter system or alter session. The default value is ENFORCED which is safe and usually necessary, but for search queries with periodically refreshed materialized views you should use STALE_TOLERATED. STALE_TOLERATED allows query rewrite even if the underlying detail data has changed and that is necessary in this case. Of course it is very dangerous to set this value at system level, therefore you should set it either in session level (if there couldn’t be any possible harm to other statements run by the particular session) or set to STALE_TOLERATED just before search query and set back to ENFORCED immediately after the search query.

Statistics – statistics should be calculated on base tables and materialized views because normally (i.e. without rewrite hint) final cost is the condition to use or not to use query rewrite.

CBO – cost based optimizer should be used. Without CBO statistics doesn’t matter and query rewrite isn’t possible. So OPTIMIZER_MODE should be one of ALL_ROWS, FIRST_ROWS, FIRST_ROWS_N or CHOOSE. Privileges necessary for other users to use query rewrite

There is practise to separate owner of data from owner of all necessary application code. Suppose you have two users db_owner owning all data and app_owner owning all procedural units and private synonyms to db_owner tables. What another privileges not mentioned in previous chapter app_owner needs to successful rewrite queries for db_owner tables?

GLOBAL QUERY REWRITE – this privilege is needed if you want to create materialized view that references to tables in any schema. For example create materialized view X in app_owner schema that references to tables in db_owner schema.

Qualify table names with schema name – this of course is neither privilege nor parameter. But this is critical if you want to rewrite queries issued in schema A for schema’s B tables and materialized views. So in db_owner schema we’ll create tables mentioned in chapter “Bright idea – materialized views”, create two materialized views and grant select, update, delete and insert on both underlying tables and materialized views to app_owner. DROP MATERIALIZED VIEW pers_full_info1; DROP MATERIALIZED VIEW pers_full_info2; CREATE MATERIALIZED VIEW pers_full_info1 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT prs_id, prs_name, prs_surname, adr_id, adr_country, adr_city, adr_street, adr_house_numb, adr_start_date, ctz_id, ctz_citizenship, ctz_start_date, brt_id, brt_date, brt_start_date FROM persons, person_addresses, person_citizenships, person_births WHERE prs_id = adr_prs_id (+) AND prs_id = brt_prs_id (+) AND prs_id = ctz_prs_id (+) AND adr_end_date (+) IS NULL AND brt_end_date (+) IS NULL AND ctz_end_date (+) IS NULL; CREATE MATERIALIZED VIEW pers_full_info2 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT prs_id, prs_name, prs_surname, adr_id, adr_country, adr_city, adr_street, adr_house_numb, adr_start_date, ctz_id, ctz_citizenship, ctz_start_date, brt_id, brt_date, brt_start_date FROM db_owner.persons, db_owner.person_addresses, db_owner.person_citizenships, db_owner.person_births WHERE prs_id = adr_prs_id (+) AND prs_id = brt_prs_id (+) AND prs_id = ctz_prs_id (+) AND adr_end_date (+) IS NULL AND brt_end_date (+) IS NULL AND ctz_end_date (+) IS NULL; exec dbms_stats.gather_table_stats(user, 'pers_full_info1', cascade =>TRUE); exec dbms_stats.gather_table_stats(user, 'pers_full_info2', cascade =>TRUE);

GRANT SELECT, INSERT, UPDATE, DELETE ON persons TO app_owner; GRANT SELECT, INSERT, UPDATE, DELETE ON person_citizenships TO app_owner; GRANT SELECT, INSERT, UPDATE, DELETE ON person_births TO app_owner; GRANT SELECT, INSERT, UPDATE, DELETE ON person_addresses TO app_owner; GRANT SELECT, INSERT, UPDATE, DELETE ON pers_full_info1 TO app_owner; GRANT SELECT, INSERT, UPDATE, DELETE ON pers_full_info2 TO app_owner; Now let’s see how both queries are rewritten: SET AUTOTRACE TRACEONLY EXPLAIN SQL> SELECT count(*) 2 FROM persons, person_addresses, person_citizenships, person_births 3 WHERE prs_id = adr_prs_id (+) 4 AND prs_id = brt_prs_id (+) 5 AND prs_id = ctz_prs_id (+) 6 AND adr_end_date (+) IS NULL 7 AND brt_end_date (+) IS NULL 8 AND ctz_end_date (+) IS NULL;

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=85 Card=1) 1 0 SORT (AGGREGATE) 2 1 TABLE ACCESS (FULL) OF 'PERS_FULL_INFO1' (Cost=85 Card=2 4982)

SQL> SELECT count(*) 2 FROM db_owner.persons, db_owner.person_addresses, 3 db_owner.person_citizenships, db_owner.person_births 4 WHERE prs_id = adr_prs_id (+) 5 AND prs_id = brt_prs_id (+) 6 AND prs_id = ctz_prs_id (+) 7 AND adr_end_date (+) IS NULL 8 AND brt_end_date (+) IS NULL 9 AND ctz_end_date (+) IS NULL;

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=85 Card=1) 1 0 SORT (AGGREGATE) 2 1 TABLE ACCESS (FULL) OF 'PERS_FULL_INFO2' (Cost=85 Card=2 4982) So now let’s connect to app_owner schema create private synonyms to tables and materialized views and let’s try query rewrite again. conn app_owner/app_owner@XXX CREATE OR REPLACE SYNONYM persons FOR db_owner.persons; CREATE OR REPLACE SYNONYM person_citizenships FOR db_owner.person_citizenships; CREATE OR REPLACE SYNONYM person_births FOR db_owner.person_births; CREATE OR REPLACE SYNONYM person_addresses FOR db_owner.person_addresses; CREATE OR REPLACE SYNONYM pers_full_info1 FOR db_owner.pers_full_info1; CREATE OR REPLACE SYNONYM pers_full_info2 FOR db_owner.pers_full_info2;

Set autotrace traceonly explain SQL> SELECT count(*) 2 FROM persons, person_addresses, person_citizenships, person_births 3 WHERE prs_id = adr_prs_id (+) 4 AND prs_id = brt_prs_id (+) 5 AND prs_id = ctz_prs_id (+) 6 AND adr_end_date (+) IS NULL 7 AND brt_end_date (+) IS NULL 8 AND ctz_end_date (+) IS NULL;

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=180 Card=1 Bytes=45) 1 0 SORT (AGGREGATE) 2 1 HASH JOIN (OUTER) (Cost=180 Card=24982 Bytes=1124190) 3 2 HASH JOIN (OUTER) (Cost=140 Card=24982 Bytes=574586) 4 3 HASH JOIN (OUTER) (Cost=95 Card=24982 Bytes=349748) 5 4 INDEX (FAST FULL SCAN) OF 'PRS_PK' (UNIQUE) (Cost= 9 Card=24982 Bytes=124910) 6 4 TABLE ACCESS (FULL) OF 'PERSON_ADDRESSES' (Cost=79 Card=24982 Bytes=224838) 7 3 TABLE ACCESS (FULL) OF 'PERSON_CITIZENSHIPS' (Cost=3 4 Card=24982 Bytes=224838) 8 2 TABLE ACCESS (FULL) OF 'PERSON_BIRTHS' (Cost=32 Card=8 05 Bytes=17710)

SQL> SELECT count(*) 2 FROM db_owner.persons, db_owner.person_addresses, 3 db_owner.person_citizenships, db_owner.person_births 4 WHERE prs_id = adr_prs_id (+) 5 AND prs_id = brt_prs_id (+) 6 AND prs_id = ctz_prs_id (+) 7 AND adr_end_date (+) IS NULL 8 AND brt_end_date (+) IS NULL 9 AND ctz_end_date (+) IS NULL;

Execution Plan ------0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=85 Card=1) 1 0 SORT (AGGREGATE) 2 1 TABLE ACCESS (FULL) OF 'PERS_FULL_INFO2' (Cost=85 Card=24982) It is obvious that without explicitly qualified schema names there wasn’t query rewrite. Ensuring query rewrite takes place There are at least two options how to ensure that query is rewritten. The first and easier one is to use autotrace in SQL*Plus as shown in chapter “The very search mechanism”. Complete reference about autotrace can be obtained in SQL*Plus® User's Guide and Reference http://download- east.oracle.com/docs/cd/B19306_01/server.102/b14357/ch8.htm#i1037182.

The second option is to use SQL trace.  Some basic info is in Oracle® Database Performance Tuning Guide http://download- east.oracle.com/docs/cd/B19306_01/server.102/b14211/sqltrace.htm#i4640.  If you need a comprehensive trace guide then the best choice to my mind is Optimizing Oracle Performance by Cary Millsap and Jeff Holt.  If you already know trace concepts and just forget necessary statements then you can use my own cab at http://www.gplivna.eu/papers/otrace.doc.

Measuring how often query rewrite takes place After you have successfully implemented and tested searching using materialized views next question raises – how many times queries were rewritten and what is the overall gain? The best and most accurate method of course would be using your code instrumentation. If you have instrumented your code, you can use it to count how many times you have executed queries you are sure are being rewritten. Then you can derive from it saved seconds and logical reads and translate them to saved $$. Without proper code instrumentation the situation is not so promising. There is no way to directly count all rewritten queries as far as I know.8 Of course there are few indirect and rather inaccurate ways to do that:  Scan v$sql_plan and search operation like ‘MAT_VIEW REWRITE%’ for 10.2.0.1 version. For 9.2.0.5 version operation = ‘TABLE ACCESS’, so you cannot distinguish it from ordinary tables. Both versions of course have object_name = .  Watch statistics in v$segment_statistics view (or v$segstat which is less user friendly but less intrusive as well) for materialized views and their indexes. You’d like to focus on logical reads. Of course materialized view refresh process as such increments statistics and therefore you need to record situation just after the refresh process has finished and just before the next refresh initiates. You can easily improve materialized view refresh process described in chapter “Alternate refresh process of materialized views” adding statistics recording steps at the very beginning and very end of the process. Although you should be aware of two things – at least on my 9.2 all statistics before materialized view refresh were zeroed i.e. after refresh process they were constant for the same data irrespective of previous values. And the second thing is logical reads statistic is sampled as it can bee seen in v$segstat_name9. It means it is not very accurate and sometimes can even not be shown in v$segment_statistics and v$segstat.

8 See for example metalink note 264473.1. Thanks for pointing out to Ghassan Salem on oracle-l list. 9 Look for example bug 3771746 on Metalink, which actually is not a bug ;) Further reading 1. Oracle® Database Data Warehousing Guide 10g Release 2 (10.2). Chapters 8 Basic Materialized Views and 9 Advanced Materialized Views. http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14223/toc.htm

2. Oracle® Database Data Warehousing Guide 10g Release 2 (10.2). Chapters 17 Basic Query Rewrite and 18 Advanced Query Rewrite. http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14223/toc.htm

3. Oracle® Database SQL Reference 10g Release 2 (10.2). Statements CREATE MATERIALIZED VIEW and CREATE MATERIALIZED VIEW LOG. http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14200/toc.htm

4. Materialised Views, Written by Howard J. Rogers. Description of dbms_advisor package and some other 10g new features. http://dizwell.com/main/content/view/23/40/

5. Room with a Better View, By Arup Nanda. Description of dbms_mview package. http://www.oracle.com/technology/oramag/oracle/05-mar/o25data.html

6. Oracle Database 10g: Top Features for DBAs, Release 2 Features Addendum, Oracle ACE Arup Nanda presents his list of the top new Oracle Database 10g Release 2 features for database administrators. Query rewrite possible with more than one materialized view.10 http://www.oracle.com/technology/pub/articles/10gdba/nanda_10gr2dba_part4.html

Summary Effective searching is one of the most common challenges for normalized databases. This article gave detailed description how to solve this problem using materialized views even in very changeable environment. Of course almost nothing in this life is given without pay and this case isn’t exception – one has to search in more or less stale data.

About the author

Gints Plivna [email protected] is system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in working with Oracle since 1997 and his interests mostly have been connected with analyzing system requirements, design, development and SQL tuning.

Contacts: e-mail - [email protected] website - http://www.gplivna.eu/

10 Although that doesn’t work well with join queries only and especially outer join queries. Licenses

This work is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.