For many Oracle database patches there is an option to apply them “online”, i.e. without stopping the database and related services (listener, ASM etc.). This is very convenient when downtime is impossible or difficult to arrange. However, this convenience comes at a certain price, including some performance penalties. Such side effects of hot patching are not well understood, even by Oracle support engineers.
There is a MOS note 761111.1 where in addition to general information about online and offline patching there is mention of possibility of increased memory usage and delays in process startups, and a “strong recommendation” is given to remove online patches with offline ones on next instance shutdown. This is not sufficient because:
- even though this note is referenced in online patch instructions, it’s described as recommended supplementary reading and not as absolute must-read
- the list of side effects is by far incomplete
- the note doesn’t state that some of the symptoms may show up not immediately, but rather after the first database instance restart (provided that the online patch wasn’t replaced with offline one).
I had to learn about it the hard way. We were facing an annoying problem with incorrect ORA-1400 errors on a 22.214.171.124 database, and after some struggling with Oracle support they finally admitted that it was an actual problem and offered a patch. Patch instructions offered two ways of installing it, online and offline, and because it was more convenient to us, we went with the online one. A week later the server was rebooted, and starting Monday, some of database code became slower. In particular, a critical batch process started to take up to twice as long as usual, primarily because of “latch: row cache objects” wait event. OEM performance page also showed high waits on CPU (not just high CPU usage, but waits for CPU) which oddly enough could not be confirmed from other sources.
Because of the one week delay between the patch application and appearance of the symptoms it took a long time to figure out the connection between the two events (Oracle support was engaged at a very high level, but that hasn’t helped much). Eventually we removed the patch and re-applied it offline which resolved the problem.
The exact mechanism that generated these symptoms remains unknown, as the service request raised with Oracle support was closed as “expected behavior” with a reference to an unpublished bug. I can only speculate that probably the primary symptom was CPU waits, which caused the latches to be held longer, which, in its turn, caused the “latch: row cache objects” waits. I can further assume that the CPU waits probably had to do with the way online patches are implemented via oradebug interface. More questions remain unanswered, e.g. why the behavior changed after the database restart, or why this particular latch of all latches (and any ideas about that are very much welcome) etc. etc.
To sum up:
– online patches can introduce unexpected side effects, including performance degradation
– these side effects can manifest themselves via latching, excessive CPU waits or in other ways
– it is particularly dangerous to bounce the database after applying an online patch without removing it and re-installing offline
– unfortunately, Oracle doesn’t do a very good job of warning DBAs of dangers of online patching (but they promised to work on it!).
5 thoughts on “Caution: hot patches”
Would you mind sharing the patch# which caused this problem?
I would like to try to reproduce it…
Thank you, xkglob
of course I don’t mind — the patch number is 7276499. But I must warn you that it may be difficult to reproduce this problem — it took us several weeks to do so on a production copy database, and it took Oracle support over a month to reproduce it in-house (but they didn’t provide any details as to how they did it).
One of the reasons why it was so hard to re-create the problem was that it didn’t affect all database activities equally. Globally, “latch: row cache objects” wait only took 2-3% of DB time, but for some processes this percentage reached 30-40% or even higher.
I will try online patch 7276499 and make some workload with swingbench, get back to you if I could reproduce the latch issue. Thanks