Nullability, indexes and joins

A few days ago, I happened to participate in an interesting thread on an OTN forum. We weren’t able to answer all of the questions raised by the original poster, but this discussion stirred some thoughts that eventually lead me to writing this post.

One of the key things that make database a database and not a data dump, is integrity constraints. They allow to implement restrictions on data in a very efficient fashion, and independently of the current application interface. But constraints also play a very important role in providing additional information about data to the query optimizer. Failure to povide that information, or providing it incorrectly that may lead to serious performance issues.

Let’s consider an example.


drop table t;

create table t (x number, y number, z number);

create index i$t on t(x);

exec dbms_stats.set_table_stats(user, 'T', numblks=>1e5, numrows=>1e7);

explain plan for select x from t;

select * from table(dbms_xplan.display);

Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    10M|   123M| 22033   (1)| 00:04:25 |
|   1 |  TABLE ACCESS FULL| T    |    10M|   123M| 22033   (1)| 00:04:25 |
--------------------------------------------------------------------------

So I created a simple table with 3 columns: X, Y and Z, created an index on X, and then made a query to select just X from the table. I didn’t bother to create any test data, but I used dbms_stats.set_table_stats to appear the table big (10M rows packed into 100k blocks). Since an index “is just a skinnier version of the table” as T. Kyte puts it, my query can be satisfied by just using the index. Still, I get a full table scan, even though my manipulations with stats have made it very expensive (unlike INDEX FAST FULL scan, as we will see soon).

The reason for that is very simple: single-column B-tree indexes don’t cannot contain NULL values, so if X contained NULLs, then an INDEX FAST FULL SCAN would have produced wrong results. Let’s check this:


explain plan for select x from t where x is not null;

select * from table(dbms_xplan.display);

Plan hash value: 3417665161

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |   500K|  6347K|     7   (0)| 00:00:01 |
|*  1 |  INDEX FAST FULL SCAN| I$T  |   500K|  6347K|     7   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("X" IS NOT NULL)

As we can see, as soon as Oracle finds that replacing table with index won’t result in data loss, it has no difficulty finding the right access path (which as I already said earlier is indeed much cheaper compared to te FULL TABLE SCAN). Now let’s try to find a way to tell Oracle that it’s safe to do such a substitution by using a constraint:

ALTER TABLE T MODIFY X CONSTRAINT NN$T NOT NULL;

explain plan for select x from t;

select * from table(dbms_xplan.display);

Plan hash value: 3417665161

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |    10M|   123M|     7   (0)| 00:00:01 |
|   1 |  INDEX FAST FULL SCAN| I$T  |    10M|   123M|     7   (0)| 00:00:01 |
-----------------------------------------------------------------------------

As we can see, it works perfectly. A word of caution: a check constraint with the same meaning won’t work here, which is yet another reason to implement NOT NULL constraints the way they were designed, not via check constraints.

It is noteworthy that if you add WHERE x=x to the query, we also get the plan with INDEX FAST FULL SCAN, i.e. optimizer is smart enough to figure out that x=x is equivalent to x is not null (since NULL=NULL evaluates to NULL, not true). In a similar way, the optimizer is able to make use of join predicates.

create table t2(w number);

exec dbms_stats.set_table_stats(user, 'T2', numblks=>1e5, numrows=>1e7);

explain plan for
select t.x
from t, t2
where t.x = t2.w;

select * from table(dbms_xplan.display);

Plan hash value: 859427532

--------------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |    10M|   247M|       | 45934   (1)| 00:09:12 |
|*  1 |  HASH JOIN            |      |    10M|   247M|   238M| 45934   (1)| 00:09:12 |
|   2 |   INDEX FAST FULL SCAN| I$T  |    10M|   123M|       |     7   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL   | T2   |    10M|   123M|       | 22033   (1)| 00:04:25 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T"."X"="T2"."W")

Things get more interesting when we turn to outer views. In principle, if the table that needs to be accessed via an index fast full scan is on the optional side (the one marked with “(+)”) of the join, then this optimization is still possible (and it makes sense: if the table is on the optional side of an outer join, then its columns will be in the output only if the join condition is evaluated to true, i.e. NULLs in the join key don’t matter in this case). However, things can easily get too complicated for the optimizer, resulting in a full table scan. This is especially likely in case of ANSI join syntax, queries using WITH clause, rownum filtering etc. So it’s much more reliable to ensure the correct plan by using a NOT NULL constraint, rather than leaving it up to the optimizer.

So to sum up:

1) Constraints often help optimizer find a better plan

2) Non-nullable constraint should be implemented as NOT NULL, not as a check constraint (countrary to what many people believe, NOT NULL constraints can have names, so there is no reason to use check constraints for this)

3) If the optimizer fails to replace a full table scan with an index fast full scan where appropriate, it’s a good idea to check nullability of the join key

Update As was pointed out in the comments (thanks again for noticing this, Kash) another way of replacing a full table scan with an index fast full scan is by making the index “nullable”. As is well known, a single-column index cannot contain NULL values, but one can work around this limitation by making a two-column index with the second column being a constant (e.g. an index on (x, 1) instead of index on (x)). This way the index will be able to cover NULL values for X since there is another non-null “column”. A nice trick!

2 thoughts on “Nullability, indexes and joins”

Kash says:

June 27, 2012 at 6:35 pm

What if index is created as composite index which one column is literal value and not null…. such as idx1 on (x,2);

1. savvinov says:
  
  June 28, 2012 at 7:38 am
  
  Hi Kash,
  
  yes, that’ll work (at least it worked on my 11.2.0.1) ! Thanks for pointing this out to me — never used that trick before.
  
  Best regards,
  Nikolay