A rather odd performance issue happened at work a few days ago. All of a sudden, one of the databases started to work very slowly, and a quick look in ASH data showed that it was spending over 70% of its time waiting on “row cache lock” (not to be confused with “latch: row cache lock”).
It was a test database (126.96.36.199) on an overloaded server with non-standard configuratioin, so my initial reaction was to bounce the database and hope that the symptoms would go away, but it didn’t work. So there was no other choice but to investigate the root cause. The next simplest thing to “turning it off and then back on” was finding the blocking session in ASH (the V$ACTIVE_SESSION_HISTORY view has columns BLOCKING_SESSION and BLOCKING_SESSION_SERIAL#). Most of the entries for this wait event had NULLs in these columns, but among those NULLs there were sid’s and serial#’s of two sessions. Unfortunately, both sessions were from background processes, MMON and SMON, so killing them obviously wasn’t a tempting choice. MMON is responsible for collecting AWR data and such, so the instance could probably survive if it gets killed, but SMON must always be running or the instace would crash.
So I looked at parameters of this event. The first parameter, P1, was always 3. According to the documentation, that parameter means the cache id, and the name of the cache can be found in V$ROWCACHE. It turned out to be DC_ROLLBACK_SEGMENTS. With this information at hand, I searched MOS for bugs with matching symptoms, but all matches found were for other Oracle versions. I took the system state dump trying to find additional information, but all I got was same session id’s that I saw in ASH. I tried a few other things, but to no avail.
A colleague suggested to re-create rollback segments. At first, I didn’t consider this approach promising, but since things I deemed more promising failed to work, I decided to try it as well. Of course it was simpler to just re-create the UNDO tablespace rather deal with individual rollback segments, so I created a new UNDO tablespace, updated UNDO_TABLESPACE parameter, and tried to drop the old one. The DROP command failed because there was still an active rollback segment. I bounced the database, and tried to do the same. The rollback segment was still active (and row cache lock events were still eating up most of database time). I used _offline_rollback_segments parameter and bounced the database again. The old UNDO tablespace still refused to drop, and row cache lock waits continued to rampage. I checked the rollback segments associated with the old UNDO tablespaces and their statuses, and found that the problem rollback segment was in NEEDS RECOVERY state. Then I tried to drop the rollback segment manually, and that worked, although only at the second attempt (the first attempt ended up in an ORA-30025 error)! The row cache lock waits disappeared as suddenly as they came, and a few minutes later the DROP command for the old UNDO tablespace finally completed.