Disjunction

Disjunction (logical OR) is known to cause various performance problems, from extreme parse times (e.g. here) to sub-optimal plans. A common solution to such problems is getting rid of OR’s by “OR expansion” (i.e. rewrite via UNION ALL), although it doesn’t work in 100% cases. In this post, I will consider an example of an OR problem that can be solved differently.

Here is the setup:

create table t(
   id number,
   x number,
   date1 date not null,
   date2 date not null,
   date3 date not null,
   date4 date not null,
   date5 date  not null
);

insert into t
select level, 
       dbms_random.value, 
       sysdate - 1000*dbms_random.value, 
       sysdate - 1000*dbms_random.value, 
       sysdate - 1000*dbms_random.value, 
       sysdate - 1000*dbms_random.value, 
       sysdate - 1000*dbms_random.value
from dual
connect by level <= 1e6;

commit;

exec dbms_stats.gather_table_stats(user, 'T');

So we have a table with a bunch of date columns, and we want to be able to efficiently run queries like this one:

select count(*)
from t
where date1 > sysdate - 1 or
date2 > sysdate - 1 or
date3 > sysdate - 1 or
date4 > sysdate - 1 or
date5 > sysdate - 1;

What are the options here? One obvious solution is to do the full tablescan. But since we are only selecting a small percentage of rows, it would be very inefficient:

  COUNT(*)
----------
      4712

Execution Plan
----------------------------------------------------------
Plan hash value: 1842905362

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    40 | 15623  (69)| 00:00:05 |
|   1 |  SORT AGGREGATE    |      |     1 |    40 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |  4737 |   185K| 15623  (69)| 00:00:05 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE3">SYSDATE@!-1 OR "DATE1">SYSDATE@!-1 OR
              "DATE4">SYSDATE@!-1 OR "DATE5">SYSDATE@!-1 OR "DATE2">SYSDATE@!-1)


Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
      20004  consistent gets
      19997  physical reads
          0  redo size
        527  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

That’s a lot of work to retrieve 0.5% of rows out of 1M row table (over 4 I/Os per each row counted… yeuch…).

What if we create a concatenated index on all date columns? Let’s try that.

create index i$t$dates1 on t(date1, date2, date3, date4, date5);
exec dbms_stats.gather_table_stats(user, 'T');

Execution Plan
----------------------------------------------------------
Plan hash value: 338327390

------------------------------------------------------------------------------------
| Id  | Operation             | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |     1 |    40 | 13990  (74)| 00:00:04 |
|   1 |  SORT AGGREGATE       |            |     1 |    40 |            |          |
|*  2 |   INDEX FAST FULL SCAN| I$T$DATES1 |  4723 |   184K| 13990  (74)| 00:00:04 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE3">SYSDATE@!-1 OR "DATE1">SYSDATE@!-1 OR
              "DATE4">SYSDATE@!-1 OR "DATE5">SYSDATE@!-1 OR "DATE2">SYSDATE@!-1)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      14787  consistent gets
          0  physical reads
          0  redo size
        527  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

We got an INDEX FAST FULL SCAN instead of a TABLE ACCESS FULL, which is an improvement, but not really a big one. In terms of logical I/O we only gained 25% or so.

What about less obvious options? What if we rewrite the query as:

select count(*)
from t
where greatest(date1, date2, date3, date4, date5) > sysdate - 1;

Note that I declared the date columns as NOT NULL to ensure that this rewrite won’t break the logical equivalence, but the same goal could also be achieved with NVL if the columns are in fact nullable.

Have we gained anything, except for making the query slightly more compact? Actually, we did, because now we can create a function-based index to help us:

create index i$t$dates2 on t(greatest(date1, date2, date3, date4, date5));
exec dbms_stats.gather_table_stats(user, 'T');

Now execution stats improve considerably:

Execution Plan
----------------------------------------------------------
Plan hash value: 2843130627

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            |     1 |     8 |    38  (14)| 00:00:01 |
|   1 |  SORT AGGREGATE   |            |     1 |     8 |            |          |
|*  2 |   INDEX RANGE SCAN| I$T$DATES2 |  5439 | 43512 |    38  (14)| 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(GREATEST("DATE1","DATE2","DATE3","DATE4","DATE5")>SYSDATE@
              !-1)


Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
         31  consistent gets
          0  physical reads
          0  redo size
        527  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Now that is real improvement: logical I/O is down by almost 3 orders of magnitude!

Note that the index itself is also smaller than the concatenated one, so we have the additional benefit of saving some space.

9 thoughts on “Disjunction”

  1. Забавно, Николай, у меня Ваш запрос вернул не 4712, а 4812 …
    Пользуясь случаем, хочу выразить благодарность за Ваш замечательный блог.
    Сергей
    Россия, Вологда.

    1. Hi Sergey, thanks for visiting my blog!

      It’s not really surprising that you got a slightly different number of rows here because there is a sysdate in the query (and if you’ll re-run the query now, you’ll see a yet different number…).

      I know it’s somewhat confusing but it would have taken me much longer to come up with a more deterministic example, so I decide that it didn’t really matter.

      Best regards,
      Nikolay

  2. Hi Nikolay!

    Sometimes it enough to use “union all” solutions, because even in your example it works quite good:
    we can create 5 indexes – one index per each column:

    In that case we haven’t FBI, we can still use these indexes for other queries and result not so bad:

    Regards,
    Sayan Malakshinov

    1. Sayan,

      thanks for your comment.

      1) I did mention that OR expansion is often a solution to OR problems, but I don’t think that it’s a very good one in this case
      2) just imagine how much uglier and more difficult to support the code would become with 4 UNION ALL’s in place. Please bear in mind that the code in my example just a simplified representation of the actual problem. The real-life prototype of it was several screens long. Now you’re basically — how would you feel if you were to support code like this?
      3) I suspect that updating 5 indexes with DML would be more expensive than updating 1 FBI
      4) I suspect that the 5 indexes would occupy more space than the FBI. I think even more than the table itself.

      So to sum up, while this might be a working solution, it’s far from being practical, let alone optimal.

      1. Nikolay,

        1. I noticed about “OR expansion”, because of that i commented exactly :)

        2. Strictly speaking oracle can do it itself, but not so good(or_expand+concatenation) as manual “union all”. So sometimes we even haven’t to rewrite code as we have to do with FBI.

        3-4. Usually other indexes already exists and they are needed for other queries, so often we have to compare just to add or not to add new FBI. Also note that when you create FBI, you create additional hidden column in the table. Also note that updates of anyone of these 5 columns will cause FBI value recalculating. So we have to also analyze frequency of updates of these columns.

  3. Hi

    One nice use of this technique is to create a virtual column (partition_date) based as greatest(date1, date2, date3, date4, date5) and interval partition table by this virtual column. Constraints defining
    date1 <= partition_date
    enable optimizer to utilize partitition pruning even if partitition_date is not mentioned.
    If dateN columns can be null, then currect version of optimizer seems to require
    mentioning in where clause 'date1<= partition_date' condition if date1 is used for filtering rows. Also nvl is required in greatest functio for nullable columns.

    This is useful for tables which have many date columns, which do not differ so much from each other and all of them can and are used in where clauses. Typically queries will refer to 'new' data, so this technique would prune partitions efficiently.

    1. Hi Lasse,

      thanks for your comment. A virtual column would be implicitly created by the optimizer when creating the FBI, but I agree that explicitly creating it could offer some advantages.

      Using this virtual column as a partitioning key is not always an option (partitioning can be not available due to licensing restrictions, or the table may need to have a different partitioning key for some reason, e.g. reference partitioning to align partitions with a logically connected table). Besides, even if partitioning can be used in principle, it’s not something that can be done easily to an existing table with data (although various techniques like ALTER TABLE EXCHANGE PARTITION could simplify that task). Those are the reasons why I didn’t focus (well, that, and the desire not to make the blog too large). Nevertheless I agree that where appropriate, this could be a very good solution.

      The most important thing, however, is to realize that getting rid of OR can be done by other means that OR exnapsion with UNION ALL — e.g. using logical equivalence and rules of arithmetic as described in the example. Once you do that, there are many other techniques that you can combine this trick with to make it even more efficient.

      Regarding NVL’s — yes, if the columns are NULLABLE then the greatest function would need to contain NVLs. I did mention that in my blog.

      Best regards,
      Nikolay

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s