Disjunction (logical OR) is known to cause various performance problems, from extreme parse times (e.g. here) to sub-optimal plans. A common solution to such problems is getting rid of OR’s by “OR expansion” (i.e. rewrite via UNION ALL), although it doesn’t work in 100% cases. In this post, I will consider an example of an OR problem that can be solved differently.
Here is the setup:
create table t( id number, x number, date1 date not null, date2 date not null, date3 date not null, date4 date not null, date5 date not null ); insert into t select level, dbms_random.value, sysdate - 1000*dbms_random.value, sysdate - 1000*dbms_random.value, sysdate - 1000*dbms_random.value, sysdate - 1000*dbms_random.value, sysdate - 1000*dbms_random.value from dual connect by level <= 1e6; commit; exec dbms_stats.gather_table_stats(user, 'T');
So we have a table with a bunch of date columns, and we want to be able to efficiently run queries like this one:
select count(*) from t where date1 > sysdate - 1 or date2 > sysdate - 1 or date3 > sysdate - 1 or date4 > sysdate - 1 or date5 > sysdate - 1;
What are the options here? One obvious solution is to do the full tablescan. But since we are only selecting a small percentage of rows, it would be very inefficient:
COUNT(*) ---------- 4712 Execution Plan ---------------------------------------------------------- Plan hash value: 1842905362 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 40 | 15623 (69)| 00:00:05 | | 1 | SORT AGGREGATE | | 1 | 40 | | | |* 2 | TABLE ACCESS FULL| T | 4737 | 185K| 15623 (69)| 00:00:05 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("DATE3">SYSDATE@!-1 OR "DATE1">SYSDATE@!-1 OR "DATE4">SYSDATE@!-1 OR "DATE5">SYSDATE@!-1 OR "DATE2">SYSDATE@!-1) Statistics ---------------------------------------------------------- 8 recursive calls 0 db block gets 20004 consistent gets 19997 physical reads 0 redo size 527 bytes sent via SQL*Net to client 524 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
That’s a lot of work to retrieve 0.5% of rows out of 1M row table (over 4 I/Os per each row counted… yeuch…).
What if we create a concatenated index on all date columns? Let’s try that.
create index i$t$dates1 on t(date1, date2, date3, date4, date5); exec dbms_stats.gather_table_stats(user, 'T'); Execution Plan ---------------------------------------------------------- Plan hash value: 338327390 ------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 40 | 13990 (74)| 00:00:04 | | 1 | SORT AGGREGATE | | 1 | 40 | | | |* 2 | INDEX FAST FULL SCAN| I$T$DATES1 | 4723 | 184K| 13990 (74)| 00:00:04 | ------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("DATE3">SYSDATE@!-1 OR "DATE1">SYSDATE@!-1 OR "DATE4">SYSDATE@!-1 OR "DATE5">SYSDATE@!-1 OR "DATE2">SYSDATE@!-1) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 14787 consistent gets 0 physical reads 0 redo size 527 bytes sent via SQL*Net to client 524 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
We got an INDEX FAST FULL SCAN instead of a TABLE ACCESS FULL, which is an improvement, but not really a big one. In terms of logical I/O we only gained 25% or so.
What about less obvious options? What if we rewrite the query as:
select count(*) from t where greatest(date1, date2, date3, date4, date5) > sysdate - 1;
Note that I declared the date columns as NOT NULL to ensure that this rewrite won’t break the logical equivalence, but the same goal could also be achieved with NVL if the columns are in fact nullable.
Have we gained anything, except for making the query slightly more compact? Actually, we did, because now we can create a function-based index to help us:
create index i$t$dates2 on t(greatest(date1, date2, date3, date4, date5)); exec dbms_stats.gather_table_stats(user, 'T');
Now execution stats improve considerably:
Execution Plan ---------------------------------------------------------- Plan hash value: 2843130627 -------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 8 | 38 (14)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 8 | | | |* 2 | INDEX RANGE SCAN| I$T$DATES2 | 5439 | 43512 | 38 (14)| 00:00:01 | -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(GREATEST("DATE1","DATE2","DATE3","DATE4","DATE5")>SYSDATE@ !-1) Statistics ---------------------------------------------------------- 8 recursive calls 0 db block gets 31 consistent gets 0 physical reads 0 redo size 527 bytes sent via SQL*Net to client 524 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
Now that is real improvement: logical I/O is down by almost 3 orders of magnitude!
Note that the index itself is also smaller than the concatenated one, so we have the additional benefit of saving some space.
Забавно, Николай, у меня Ваш запрос вернул не 4712, а 4812 …
Пользуясь случаем, хочу выразить благодарность за Ваш замечательный блог.
Сергей
Россия, Вологда.
Hi Sergey, thanks for visiting my blog!
It’s not really surprising that you got a slightly different number of rows here because there is a sysdate in the query (and if you’ll re-run the query now, you’ll see a yet different number…).
I know it’s somewhat confusing but it would have taken me much longer to come up with a more deterministic example, so I decide that it didn’t really matter.
Best regards,
Nikolay
Hi Nikolay!
Sometimes it enough to use “union all” solutions, because even in your example it works quite good:
we can create 5 indexes – one index per each column:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
disj.sql
hosted with ❤ by GitHub
In that case we haven’t FBI, we can still use these indexes for other queries and result not so bad:
https://gist.github.com/xtender/b7635b0621ded18274c1
Regards,
Sayan Malakshinov
Sorry, right link for spool:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
disj_spool.sql
hosted with ❤ by GitHub
Sayan,
thanks for your comment.
1) I did mention that OR expansion is often a solution to OR problems, but I don’t think that it’s a very good one in this case
2) just imagine how much uglier and more difficult to support the code would become with 4 UNION ALL’s in place. Please bear in mind that the code in my example just a simplified representation of the actual problem. The real-life prototype of it was several screens long. Now you’re basically — how would you feel if you were to support code like this?
3) I suspect that updating 5 indexes with DML would be more expensive than updating 1 FBI
4) I suspect that the 5 indexes would occupy more space than the FBI. I think even more than the table itself.
So to sum up, while this might be a working solution, it’s far from being practical, let alone optimal.
Nikolay,
1. I noticed about “OR expansion”, because of that i commented exactly :)
2. Strictly speaking oracle can do it itself, but not so good(or_expand+concatenation) as manual “union all”. So sometimes we even haven’t to rewrite code as we have to do with FBI.
3-4. Usually other indexes already exists and they are needed for other queries, so often we have to compare just to add or not to add new FBI. Also note that when you create FBI, you create additional hidden column in the table. Also note that updates of anyone of these 5 columns will cause FBI value recalculating. So we have to also analyze frequency of updates of these columns.
btw, even if i had chosen FBI, for convenience and transperancy i’d create virtual column with logical name like “VIRT_LATEST_DATE” and the index on this column.
Hi
One nice use of this technique is to create a virtual column (partition_date) based as greatest(date1, date2, date3, date4, date5) and interval partition table by this virtual column. Constraints defining
date1 <= partition_date
enable optimizer to utilize partitition pruning even if partitition_date is not mentioned.
If dateN columns can be null, then currect version of optimizer seems to require
mentioning in where clause 'date1<= partition_date' condition if date1 is used for filtering rows. Also nvl is required in greatest functio for nullable columns.
This is useful for tables which have many date columns, which do not differ so much from each other and all of them can and are used in where clauses. Typically queries will refer to 'new' data, so this technique would prune partitions efficiently.
Hi Lasse,
thanks for your comment. A virtual column would be implicitly created by the optimizer when creating the FBI, but I agree that explicitly creating it could offer some advantages.
Using this virtual column as a partitioning key is not always an option (partitioning can be not available due to licensing restrictions, or the table may need to have a different partitioning key for some reason, e.g. reference partitioning to align partitions with a logically connected table). Besides, even if partitioning can be used in principle, it’s not something that can be done easily to an existing table with data (although various techniques like ALTER TABLE EXCHANGE PARTITION could simplify that task). Those are the reasons why I didn’t focus (well, that, and the desire not to make the blog too large). Nevertheless I agree that where appropriate, this could be a very good solution.
The most important thing, however, is to realize that getting rid of OR can be done by other means that OR exnapsion with UNION ALL — e.g. using logical equivalence and rules of arithmetic as described in the example. Once you do that, there are many other techniques that you can combine this trick with to make it even more efficient.
Regarding NVL’s — yes, if the columns are NULLABLE then the greatest function would need to contain NVLs. I did mention that in my blog.
Best regards,
Nikolay