Disjunction (logical OR) is known to cause various performance problems, from extreme parse times (e.g. here) to sub-optimal plans. A common solution to such problems is getting rid of OR’s by “OR expansion” (i.e. rewrite via UNION ALL), although it doesn’t work in 100% cases. In this post, I will consider an example of an OR problem that can be solved differently.
In my previous post I mentioned method R as probably the most efficient approach to SQL optimization. However, it is important to focus on correct metrics for it to work correctly.
Consider this example (once again, the query is still running, so the only reliable diagnostic tool at our disposal is SQL real-time monitor):
First principles, Clarice. Simplicity. Read Marcus Aurelius. Of each particular thing ask: what is it in itself? What is its nature?
What do you do when you get a call about a query that keeps running without completing, where the query text itself looks very simple:
select col1, col2, ... from someview where colA = '12345-A' and colB in ('B') order by colN, colM
and the SQL plan monitor report looks like this:
Year 2014 was a successful one for me. I was blessed with quite a few interesting performance issues which allowed me to expand my horizons. One topic I’d like to mention in particular is redo writing. I spent a few weeks optimizing a batch load process and since redo writing was a serious bottleneck, I had to spend some time understanding internals of the process. This resulted in a series of posts on log writing received very favorable feedback from several top experts in the field, such as Jonathan Lewis, Tanel Poder, Kevin Closson, Doug Burns, Kyle Hailey and many others. Another interesting topic that I came across in 2014 is DML internals, in particular the factors that determine whether a DML statement will be processed row-by-row or in bulk. That lead to discovery of undocumented behavior which as far as I can tell was not know prior to that.
All that allowed my blog audience to grow significantly. I have crossed the 100,000 visits mark in 2014, and judging by current visit statistics, I might be able to double that number at some point in 2015. This is important to me because along with other metrics (retweets/reposts, comments, subscribers etc.) it shows me that I’m on the right track. I was also honored to be added to OakTable’s “good Oracle blogs to follow” list.
One of the frustrations in 2014 was that some of the good stuff in this blog remains unknown to the broader audience because of relatively low search engine rankings (ironically, some of my older posts that I myself don’t find particularly interesting, are doing much better in that area). If anyone who is reading this blog knows how this can be addressed (without spamming of course!), I would very much appreciate an advice from you.
Anyways, I’d like to thank all the readers of this blog and wish you all the best in the new year!
This year was rich in interesting performance issues. This post is about one recent issue that was particularly intriguing because it was:
– not associated with any particular database activity
– it tended to appear at a certain time of the day (between 2am and 4am database time), but there were some variations
– there was no discernible pattern in days when the problem appeared
– performance degradation was due to a whole bunch of various “bad” wait events rather than just one
– there was no correlation between the problem appearance and the workload metrics.
A couple of weeks back I received an update on my “4k bug” SR (slow multirow DML performance for small block sizes). As it turns out, the observed behavior is not a bug. It’s actually more interesting than that. It is an undocumented feature that controls the insert mode depending on the estimated row size (the sum of all column sizes). If the estimated column size exceeds the size of the block, then Oracle switches to row-by-row processing.
This can be illustrated with the example below (in a 4k tablespace):
select * from v$version; Oracle Database 11g Enterprise Edition Release 126.96.36.199.0 - 64bit Production PL/SQL Release 188.8.131.52.0 - Production "CORE 184.108.40.206.0 Production" TNS for Solaris: Version 220.127.116.11.0 - Production NLSRTL Version 18.104.22.168.0 - Production set autotrace on statonly create table t4k(x number, padding varchar2(4000)); insert into t4k select level, rpad(' ', 100, ' ') from dual connect by level <= 10000; Statistics ---------------------------------------------------------- 1 recursive calls 11200 db block gets 313 consistent gets 109 physical reads 3733140 redo size 830 bytes sent via SQL*Net to client 857 bytes received via SQL*Net from client 3 SQL*Net roundtrips to/from client 2 sorts (memory) 0 sorts (disk) 10000 rows processed create table t4k2(x number, padding varchar2(2000)); insert into t4k2 select level, rpad(' ', 100, ' ') from dual connect by level <= 10000; Statistics ---------------------------------------------------------- 1 recursive calls 1255 db block gets 616 consistent gets 19 physical reads 1317000 redo size 838 bytes sent via SQL*Net to client 858 bytes received via SQL*Net from client 3 SQL*Net roundtrips to/from client 2 sorts (memory) 0 sorts (disk) 10000 rows processed
As you can see, it sufficies to decrease the column length so that the maximum row length would become less than the block size to significantly improve performance (e.g. reduce redo generated by a factor of 3!).
This means that specifying large column size “in case” can get you in trouble, if it brings the total column size over the block size. Obviously, the smaller the block size, the larger the risk.
The query below can identify the tables that will have slow bulk DML because of the high maximum row length:
select c.owner, c.table_name, sum(data_length) est_row_length, ts.block_size from dba_tab_columns c, (select owner, table_name, tablespace_name from dba_tables union select table_owner owner, table_name, tablespace_name from dba_tab_partitions ) t, dba_tablespaces ts where c.owner = t.owner and c.table_name = t.table_name and ts.tablespace_name = t.tablespace_name group by ts.block_size, c.owner, c.table_name having sum(data_length)>=ts.block_size order by 1, 2
For the tables returned by this query, bulk DML will be internally performed row-by-row, decreasing performance and increasing redo generation. If you want to change this, you’ll need to either shrink column size, or migrate the table to a tablespace with a larger block size.
Sometimes the optimizer makes really odd decisions. We can use optimizer (“event 10053″) tracing to obtain clues as to why such decisions were taken, but it’s not simple. First, such trace files tend to be big. Second, they don’t always contain the necessary information. There is a simple trick that can be useful in such cases.
First, I use “optimizer_features_enable” hint to try and find Oracle version where the logic changes. Of course, I don’t need to gather optimizer trace files for that purpose: it suffices to look at explain plan like this:
explain plan for select /*+ optimizer_features_enable('x.x.x.x') */ ... select * from table(dbms_xplan.display)
changing the version one by one.
If that odd behavior is present in all versions, we’re out of luck and the method won’t work. But if it the behaviour changes, then I take two optimizer trace files for the versions immediately before and after the change, and run a diff (e.g. using WinMerge or a similar tool) for the sections that list parameter values:
Then I go through the list of parameters that have different values, and plug them into the query text with opt_param hint one by one to see if any of them controls the behavior in question:
explain plan for select /*+ opt_param('_optimizer_transitivity_retain', 'false') */ ...
It doesn’t work 100% of the time, but in general it has a decent shot at working because weird optimizer behavior is often coming from some advanced features introduced into it at some point, and when such features are added, they normally come with a parameter (as a rule, a hidden one) that allows to switch it off.
But if even that doesn’t work, there something else you can do: look further down in the optimizer trace file, at the list of bugfixes, and go through them one by one, searching in MOS if bug symptoms are similar to your case. But since the list of bugfixes is quite long, I am too lazy to ever do that myself.
Hope that helps someone.
In the database world (especially among the database developers) a commit is often viewed as some sort of a virtual “save” button — i.e. it’s something that you need to do to make your changes permanent. This is one of the reasons why developers are often happy to commit work as soon as they get a chance (even if it’s not properly finished), so that they wouldn’t lose their changes in case of an error. Unfortunately, they don’t always think the whole thing through — restarting a process interrupted half-way may be much more difficult than re-doing everything from scratch.
Let’s consider a hypothetical scenario. Suppose you have a process A that you want to run faster. This process doesn’t commit (much), so it doesn’t wait on log file sync. However, there is another multi-threaded process, B, that commits very frequently, and spends a lot of time on “log file sync”. You don’t care about the process B, your only goal is to make A go faster. After exhausting your tuning arsenal (to no avail), you go to a production DBA. He looks at the AWR report and says “you’re committing too much”. “Look how much time the system spends on log file sync”. You tell him that the process A doesn’t commit much and doesn’t suffer from log file sync waits, but the DBA insists: “Even so, reducing commits would improve the database performance health in general, and by doing so it would benefit your process as well. Besides, getting rid of all that log file sync noise would help us see the problem with process A more clearly”. You are convinced, and after spending some time with the code, you find a bunch of unnecessary commits inside loops. You remove them. You reduce the commit rate per second by several orders of magnitude! Your database is much healthier now! And your process A will now run… almost four times slower than it did before.
If the end of this story surprises you, it really shouldn’t. Let me show you AWR exceprts that simulate this situation:
Another piece of good news — Oracle has opened a bug for yet another anomaly I reported earlier in my blog: row-by-row processing of bulk DML when the block size on the target table is less than the default 8k. So it’s now officially bug 20039770 – “DML SLOW WITH 4K BLOCK SIZE VS 8K BLOCK SIZE”. Their bug description seems a bit off (because 2K shows the same behavior as 4K, and 16/32K as 8K, so it’s not really a case of “4K VS 8K”), but I’m sure they update it accordingly in the course of the investigation by the development team. Unlike the log parallelism bug, this one is not yet open to public (maybe because they don’t have anything to put into the bug note yet, not even a workaround), I’ll post an update when that changes.
If you ever work with small block sizes then I highly recommend you familiarize yourself with this bug, because its impact on DML performance is quite big (I observed x3-x7 effect, but it could be even larger in a different setup). It makes rewrites of row-by-row logic via bulk DML pretty much useless, so if I had known about it earlier, it would have saved me a couple of weeks!
I will post another blog when Oracle concludes their investigation.