# Histograms for strongly skewed columns

On a recent OTN thread, I learned a nice trick by J. Lewis that allows to circumvent certain problems with histograms.

Histograms were designed to solve the problem of estimating cardinality for skewed columns (i.e. where some values occur much more frequently than the others). For columns with low number of distinct values (NDV) Oracle collects a frequency histogram, which can be thought of as a set of two one-dimensional arrays:  one containing all possible values, the other containing their frequency (i.e. how many rows have this value). However, if sample size is small, then Oracle can miss rare values, and they won’t be reflected in the histogram. As a result, the cardinality estimates for those values will be wrong (depending on version Oracle will either set it to either 1 or to half of the frequency for the rarest value found).  A detailed explanation of the issues with examples can be found in blog posts by J. Lewis and R. Geist.

The trick is to create a function-based index that hides popular values, thus making sure that rare values are accurately represented in a histogram. Let’s consider an example: a table T with just one column X, which has two possible values, 1 (1M occurences) and 2 (10 occurences).

```
create table t as
with gen as (select 1 from dual connect by level<=1e3)
select 1 x from gen g1, gen g2;

insert into t select 2 from dual connect by level<=10;

commit;

create index i\$t\$x on t(x);

exec dbms_stats.gather_table_stats(null, 'T', method_opt=>'for all columns size 254');</pre>
explain plan for select * from t where x=2;

select * from table(dbms_xplan.display());

Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   500K|  1464K|   448   (3)| 00:00:06 |
|*  1 |  TABLE ACCESS FULL| T    |   500K|  1464K|   448   (3)| 00:00:06 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("X"=2)```

As we can see, we get a wrong cardinality with cardinality off by 5 orders of magnitude.

Now let’s create a function-based index that would “hide” the popular value:

```create index i\$t\$x_rare on t(case(x) when 2 then x end);

exec dbms_stats.gather_table_stats('SCOTT', 'T', method_opt=>'for all columns size 254');

```

and modify the query to use statistics on the virtual column created by the FBI

```explain plan for select * from t where case(x) when 2 then x end =2;

select * from table(dbms_xplan.display());

Plan hash value: 1371096493

------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    10 |    40 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T          |    10 |    40 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | I\$T\$X_RARE |    10 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(CASE "X" WHEN 2 THEN "X" END =2)

```

Now we are getting a better plan and correct cardinalities.

Of course it is a big disadvantage that we need to rewrite the query to make use of the virtual column, so in some cases it may be a better solution to use another trick suggested by J. Lewis: use “hand-made” histograms that provide a more realistic distribution. Of course, it is also always possible to increase the sample size, making sure that rare values aren’t missed, if the data size is not so big that larger sample size would make calculating histograms prohibitively expensive.

It should be also possible to the virtual column approach for dealing with correlated predicates.

## 5 thoughts on “Histograms for strongly skewed columns”

1. Nice follow up Nickolay , but i would suggest you mention oracle version too, as i’m thinking how about setting dbms_stats.auto_sample_size in 11g gather accurate samples

1. Hi Ranjit,

I suspect that despite certain improvements, auto_sample_size in 11g still doesn’t provide a 100% accurate histogram in all cases. I’m hoping to do a little research on this and make a post about it.

Best regards,
Nikolay

2. Ranjit,

In 11g the “approximate NDV code will allow Oracle to get an accurate number of distinct values for a column, but it will still take a (possibly small) sample when calculating the histogram. I don’t think Oracle will increase the sample until the number of values seen in the histogram matches the number seen by the NDV code (I may be wrong, of course, I’ll have to test the hypothesis at some point.)

My keyboard is acting strangely, apologies if the above comment comes out with lots of strange spellings and spare words.

1. Hi Jonathan,

thanks for your comment, it’s an honor to see you in my blog.

If I remember Randolf’s post correctly, he was talking about the possibility of the number frequency histogram buckets being different from NDV in 11g (in this case, cardinality decay wasn’t applied to values outside the (min, max) range). I interpreted this in sense that the new algorithm doesn’t increase the sample until the number of values in a histogram matches the number seen by the NDV code (otherwise the two would have always been the same).

I must admin though, that I didn’t test this myself.

Will try tomorrow when I get a chance.

Best regards,
Nikolay

2. Hi Jonathan,

I tested your hypothesis — as you suspected, in 11g Oracle does not increase the histogram sample size to match the number of found distinct values to previously calculated NDV.

Here is the script I ran on my Oracle 11.2.0.1:

```select * from v\$version;

drop table t;

create table t as
with gen as (select 1 from dual connect by level<=1e3)
select 1 x from gen g1, gen g2, gen g3
where rownum<=1e7;

insert into t select 2 from dual;

commit;

exec dbms_stats.gather_table_stats(null, 'T', method_opt=>'for all columns size 254');

select table_name, column_name, num_distinct, num_buckets from user_tab_col_statistics where table_name = 'T';
T	X	2	1
```

In the end we get NDV=2, but the histogram only has 1 bucket.