There exists a fair amount of confusion around one of the most famous Oracle errors — ORA-01555 (“snapshot too old”). The difficulty is probably coming from the fact that ORA-01555 is related to internal workings of undo generation and usage in an Oracle database, which a rather complex. Because of that, when this error on their system, inexperienced DBAs and developers often either start looking at UNDO-related parameters, rollback segment sizes, etc., or start looking for any suspicious activity, which they later link to excessive undo “consumption” or “generation”.
In many cases, however, things are much simpler. ORA-01555 is typically just another name for a performance problem (I’m not talking about the LOB-related ORA-01555 here). Here is a typical scenario: change in SQL plan, a hardware problem, or something else resulting in performance degradation makes the statement take longer than usual to complete, and eventually there’s not enough UNDO to maintain read consistency. And all it takes to find the culprit, is look in the alert log file and look up the sql id next to the ORA-01555 message. And all it takes to confirm, is examine recent performance history for this statement (using AWR views or application logs) to make sure that indeed it took longer than usual.
Of course there is always a possibility of a more complex mechanism be behind this error, but on a stable database where parameters are not played with too eagerly, it doesn’t make a lot of sense to jump to undo retention parameters and rollback segment sizes immediately — if there is ever need for that, it could always be done later, after spending a couple of minutes on pursuing the other, much more likely possibility.
Recently, I had a few cases of ORA-01555, one of which was interesting because of a rather long causality chain involved. It was a test 10g database which was supposed to be production-like in most aspects, and some jobs were found failed with this error. As it turned out eventually, here is what happened:
1) a custom stats job entered an infinite loop because of a bug in code
2) as a result, daily maintenance jobs that create partitions for some table and rebuild/compressed local indexes on them, started piling up
3) as a result of 2), some of such jobs were running at unusual times; because of DDL in these jobs Oracle couldn’t use some of the indexes, which resulted in a query in a job taking much longer than usual
4) eventually, there wasn’t enough undo retained on the system to maintain the read consistency, and the statement failed with ORA-01555, causing the entire job to fail.
Once one realizes that this is a performance issue, getting to the root cause is relatively straightforward (from DBA_HIST_SQL_PLAN it was clear that something was making an index periodically unavailable, and DBA_AUDIT_TRAIL combined with the application log told why). But if one started off with a wrong idea in mind, then it could’ve taken hours or even days to get back on the right track.