“Snapshot too old” as a performance issue in disguise

11 Mar

There exists a fair amount of confusion around one of the most famous Oracle errors — ORA-01555 (“snapshot too old”). The difficulty is probably coming from the fact that ORA-01555 is related to internal workings of undo generation and usage in an Oracle database, which a rather complex. Because of that, when this error on their system, inexperienced DBAs and developers often either start looking at UNDO-related parameters, rollback segment sizes, etc., or start looking for any suspicious activity, which they later link to excessive undo “consumption” or “generation”.

In many cases, however, things are much simpler. ORA-01555 is typically just another name for a performance problem (I’m not talking about the LOB-related ORA-01555 here). Here is a typical scenario: change in SQL plan, a hardware problem, or something else resulting in performance degradation makes the statement take longer than usual to complete, and eventually there’s not enough UNDO to maintain read consistency. And all it takes to find the culprit, is look in the alert log file and look up the sql id next to the ORA-01555 message. And all it takes to confirm, is examine recent performance history for this statement (using AWR views or application logs) to make sure that indeed it took longer than usual.

Of course there is always a possibility of a more complex mechanism be behind this error, but on a stable database where parameters are not played with too eagerly, it doesn’t make a lot of sense to jump to undo retention parameters and rollback segment sizes immediately — if there is ever need for that, it could always be done later, after spending a couple of minutes on pursuing the other, much more likely possibility.

Recently, I had a few cases of ORA-01555, one of which was interesting because of a rather long causality chain involved. It was a test 10g database which was supposed to be production-like in most aspects, and some jobs were found failed with this error. As it turned   out eventually, here is what happened:

1) a custom stats job entered an infinite loop because of a bug in code

2) as a result, daily maintenance jobs that create partitions for some table and rebuild/compressed local indexes on them, started piling up

3) as a result of 2), some of such jobs were running at unusual times; because of DDL in these jobs Oracle couldn’t use some of the indexes, which resulted in a query in a job taking much longer than usual

4) eventually, there wasn’t enough undo retained on the system to maintain the read consistency, and the statement failed with ORA-01555, causing the entire job to fail.

Once one realizes that this is a performance issue, getting to the root cause is relatively straightforward (from DBA_HIST_SQL_PLAN it was clear that something was making an index periodically unavailable, and DBA_AUDIT_TRAIL combined with the application log told why). But if one started off with a wrong idea in mind, then it could’ve taken hours or even days to get back on the right track.

About these ads

3 Responses to ““Snapshot too old” as a performance issue in disguise”

  1. Yasir March 17, 2013 at 2:53 pm #

    I have seen quite a lot ora-01555 errors in recent times. Most of the time we run automatic undo advisor that advises how much undo retention we need based on the largest running query. after we set the undo retention to that value, the error disappears.

    • savvinov March 17, 2013 at 4:32 pm #

      Hi Yasir,

      undo retention should be properly configured before going live. Or when the application workload profile changes in such a way that existing undo settings are no longer adequate. Simply increasing UNDO retention every time there’s an ORA-01555 is not a very good solution. It is pretty much same as increasing application timeouts after a timeout occurs — sure, the symptom goes away, but the underlying problem (poor performance) is still there.

      • Yasir March 29, 2013 at 9:53 am #

        In the quest of resolving ora-01555 erros, we do advise our users to change the code or lists out them queries using high undo but change in code takes time and the immediate solution left is increase undo retention and undo tablespace.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 344 other followers

%d bloggers like this: