Occam’s razor

This year was rich in interesting performance issues. This post is about one recent issue that was particularly intriguing because it was:

– intermittent

– not associated with any particular database activity

– it tended to appear at a certain time of the day (between 2am and 4am database time), but there were some variations

– there was no discernible pattern in days when the problem appeared

– performance degradation was due to a whole bunch of various “bad” wait events rather than just one

– there was no correlation between the problem appearance and the workload metrics.

Continue reading “Occam’s razor”

“4k bug” is not a bug

A couple of weeks back I received an update on my “4k bug” SR (slow multirow DML performance for small block sizes). As it turns out, the observed behavior is not a bug. It’s actually more interesting than that. It is an undocumented feature that controls the insert mode depending on the estimated row size (the sum of all column sizes). If the estimated column size exceeds the size of the block, then Oracle switches to row-by-row processing.

This can be illustrated with the example below (in a 4k tablespace):

select * from v$version;
  
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
"CORE	11.2.0.3.0	Production"
TNS for Solaris: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production  

set autotrace on statonly

create table t4k(x number, padding varchar2(4000));

insert into t4k 
select level, rpad(' ', 100, ' ')
from dual
connect by level <= 10000;

Statistics
----------------------------------------------------------
          1  recursive calls
      11200  db block gets
        313  consistent gets
        109  physical reads
    3733140  redo size
        830  bytes sent via SQL*Net to client
        857  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      10000  rows processed


create table t4k2(x number, padding varchar2(2000));

insert into t4k2
select level, rpad(' ', 100, ' ')
from dual
connect by level <= 10000;


Statistics
----------------------------------------------------------
          1  recursive calls
       1255  db block gets
        616  consistent gets
         19  physical reads
    1317000  redo size
        838  bytes sent via SQL*Net to client
        858  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      10000  rows processed

As you can see, it sufficies to decrease the column length so that the maximum row length would become less than the block size to significantly improve performance (e.g. reduce redo generated by a factor of 3!).

This means that specifying large column size “in case” can get you in trouble, if it brings the total column size over the block size. Obviously, the smaller the block size, the larger the risk.

The query below can identify the tables that will have slow bulk DML because of the high maximum row length:

  select c.owner, 
         c.table_name, 
         sum(data_length) est_row_length, 
         ts.block_size
  from dba_tab_columns c,
       (select owner, table_name, tablespace_name 
        from dba_tables 
        union select table_owner owner, table_name, tablespace_name 
        from dba_tab_partitions
       ) t,
       dba_tablespaces ts
  where c.owner = t.owner
  and c.table_name = t.table_name
  and ts.tablespace_name = t.tablespace_name
  group by ts.block_size, c.owner, c.table_name
  having sum(data_length)>=ts.block_size
  order by 1, 2

For the tables returned by this query, bulk DML will be internally performed row-by-row, decreasing performance and increasing redo generation. If you want to change this, you’ll need to either shrink column size, or migrate the table to a tablespace with a larger block size.

Making sense out of optimizer trace files

Sometimes the optimizer makes really odd decisions. We can use optimizer (“event 10053”) tracing to obtain clues as to why such decisions were taken, but it’s not simple. First, such trace files tend to be big. Second, they don’t always contain the necessary information. There is a simple trick that can be useful in such cases.

First, I use “optimizer_features_enable” hint to try and find Oracle version where the logic changes. Of course, I don’t need to gather optimizer trace files for that purpose: it suffices to look at explain plan like this:

explain plan for
select /*+ optimizer_features_enable('x.x.x.x') */ ...

select * from table(dbms_xplan.display)

changing the version one by one.

If that odd behavior is present in all versions, we’re out of luck and the method won’t work. But if it the behaviour changes, then I take two optimizer trace files for the versions immediately before and after the change, and run a diff (e.g. using WinMerge or a similar tool) for the sections that list parameter values:

optimizer_compare

Then I go through the list of parameters that have different values, and plug them into the query text with opt_param hint one by one to see if any of them controls the behavior in question:

explain plan for
select /*+ opt_param('_optimizer_transitivity_retain', 'false') */ ...

 

It doesn’t work 100% of the time, but in general it has a decent shot at working because weird optimizer behavior is often coming from some advanced features introduced into it at some point, and when such features are added, they normally come with a parameter (as a rule, a hidden one) that allows to switch it off.

But if even that doesn’t work, there something else you can do: look further down in the optimizer trace file, at the list of bugfixes, and go through them one by one, searching in MOS if bug symptoms are similar to your case. But since the list of bugfixes is quite long, I am too lazy to ever do that myself.

Hope that helps someone.

Transactional integrity

In the database world (especially among the database developers) a commit is often viewed as some sort of a virtual “save” button — i.e. it’s something that you need to do to make your changes permanent. This is one of the reasons why developers are often happy to commit work as soon as they get a chance (even if it’s not properly finished), so that they wouldn’t lose their changes in case of an error. Unfortunately, they don’t always think the whole thing through — restarting a process interrupted half-way may be much more difficult than re-doing everything from scratch.

Continue reading “Transactional integrity”