Excessive commits?

Let’s consider a hypothetical scenario. Suppose you have a process A that you want to run faster. This process doesn’t commit (much), so it doesn’t wait on log file sync. However, there is another multi-threaded process, B, that commits very frequently, and spends a lot of time on “log file sync”. You don’t care about the process B, your only goal is to make A go faster. After exhausting your tuning arsenal (to no avail), you go to a production DBA. He looks at the AWR report and says “you’re committing too much”. “Look how much time the system spends on log file sync”. You tell him that the process A doesn’t commit much and doesn’t suffer from log file sync waits, but the DBA insists: “Even so, reducing commits would improve the database performance health in general, and by doing so it would benefit your process as well. Besides, getting rid of all that log file sync noise would help us see the problem with process A more clearly”. You are convinced, and after spending some time with the code, you find a bunch of unnecessary commits inside loops. You remove them. You reduce the commit rate per second by several orders of magnitude! Your database is much healthier now! And your process A will now run… almost four times slower than it did before.

If the end of this story surprises you, it really shouldn’t. Let me show you AWR exceprts that simulate this situation:

1) before “fixing” the commit “problem”

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               12.6                0.0       0.00       8.86
       DB CPU(s):                2.2                0.0       0.00       1.53
       Redo size:       31,127,512.1            2,555.5
   Logical reads:           76,328.2                6.3
   Block changes:           85,515.9                7.0
  Physical reads:               12.8                0.0
 Physical writes:               19.5                0.0
      User calls:                1.4                0.0
          Parses:              127.5                0.0
     Hard parses:                2.1                0.0
W/A MB processed:                2.7                0.0
          Logons:                0.7                0.0
        Executes:           21,809.0                1.8
       Rollbacks:                0.0                0.0
    Transactions:           12,180.4

ASH wait profile for process A:

ON CPU 100%

2. After “fixing” commit “problem”

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               59.7               59.0       0.00      51.21
       DB CPU(s):                2.1                2.1       0.00       1.78
       Redo size:       30,199,251.1       29,844,727.2
   Logical reads:           50,041.5           49,454.1
   Block changes:           63,557.1           62,811.0
  Physical reads:               11.2               11.0
 Physical writes:            2,461.8            2,432.9
      User calls:                1.2                1.2
          Parses:               73.3               72.4
     Hard parses:                1.0                1.0
W/A MB processed:                0.4                0.4
          Logons:                0.5                0.5
        Executes:           23,029.3           22,758.9
       Rollbacks:                0.0                0.0
    Transactions:                1.0

ASH wait profile for process A:

log buffer space	        54.6
ON CPU                  	18.2
latch: redo copy	        9.1
log file switch completion	9.1
free buffer waits	        9.1

Hopefully, it should be clear at this point what exactly happened here: the activity of interest (“A”) didn’t commit much, but it did generate redo. When commits in “B” were reduced, the log file sync problem (which only affect frequent committers) transformed into the log buffer space problem (which affects all redo-generating sessions, no matter whether they commit or not).

The hypothetical DBA made two important mistakes here:
– try to improve performance “in general” instead of addressing the problem at hand
– blame excessive commits for the log file sync issue.

In reality, the root cause of high log file sync waits was that the database was producing too much redo (or the redo filesystem was processing redo too slowly — depending how you look at it). The redo filesystem was operating at the limit of its capacity, so “resolving” the log file sync issue inevitably caused the activity of interest to go slower.

Of course this example doesn’t mean that excessive commits are good — they’re not, and not just because they hurt the performance of the committing process, but more importantly, because they are bad for database’s transactional integrity. But in some cases existing bottlenecks happen to play a positive role by acting as some sort of a “resource manager”: they can prevent low-priority activity from consuming scarce resources that are also necessary for some higher-priority activities. So when you’re about to reduce the number of commits in your application to improve performance, make sure that you understand what you are doing and why, and what process would benefit from this change the most (and what processes may pay for this benefit).

9 thoughts on “Excessive commits?

  1. Thanks for posting. This scenario is rather frequent in poorly designed DB applications.
    However, the solution exists and is quite simple: temporary tables.
    Process A should write to temporary tables (which generate no redo) and commit only when ready. Committing very slowly when another process is committing frequently would also cause the “snapshot too old” error, which is quite nasty.

    1. Hi Cristiano,

      thanks for your comment. I’m afraid you got sidetracked by the symptoms of the issue and missed the point that I was trying to make. It’s about the weird notion that many DBAs appear to have that improving “database performance health in general” would make everything in the database go faster. In most cases, this is not true. Fixing a bottleneck (log file sync in this case) means making a process go faster, which in its turn means consuming resources at a faster rate. So fixing a bottleneck in a wrong place is more likely to make the process you’re trying to speed up slower, not faster.

      Regarding the “snapshot too old error”: it’s a symptom of either a performance problem or inadequate UNDO retention settings. Resolving it by adding redundant commits is just wrong.

      In general, it’s wrong to make commits more or less frequent because of any considerations that aren’t related to the business logic. A transaction is a group of actions that have the same meaning and purpose and should fail or succeed together as one. It’s not up to a DBA or database developer to define a transaction, they should simply comply with the business definition of it.

      Best regards,
      Nikolay

      1. Hi Nikolay,

        well, I think the DBA should also suggest a review of what is really critical in the business logic and what not. It’s not always possible, but it should be done.
        Even with UNDO retention set to 0, but with huge transactions, you can have “snapshot too old”. If 2 TB of UNDO tablespace are not enough, probably the logic has to be reviewed.
        In fact, using the temporary tablespaces reduces both the REDO logs and the UNDO retention requirements. In many cases, sessions that seldom commit are bound by network speed, and this adds another point of failure.
        Going back to your scenario, it’s clear that the overload moves from the undo to the redo logs, but anyway you have an overload, so maybe the HW is undersized. In your post you consider the action of reducing the commits, but that means again reviewing the business logic, so it’s exactly what I suggest. For example, another action would be to consolidate the shorter transactions to one or few sessions, which also decreases the sync waits. In this case, you change the client implementation of the business logic.
        Best regards

        Cristiano

      1. Hi Jakub,

        because of the piggyback mechanism, there’s no queue for commits, at least not in the traditional sense (although James Morle has a way to describe it as a very special kind of a queue, infinitely wide and just two elements deep).

        Best regards,
        Nikolay

  2. Cristiano,

    I’m not going to go into a detailed discussion of properly configuring UNDO settings — it’s not within the scope of this post, and this blog in general. Let me just recommend you a link that should hopefully clarify the subject for you: https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:275215756923 (in particular, it addresses your misconception that frequent commits help avoid ORA-1555 and shows that they are likely to have the opposite effect).

    Regarding my example. The overload didn’t move from undo to redo, because there undo was never an issue. The amount of undo per unit time was roughly same in both cases, and very moderate (below 2 MB/s — this is peanuts).

    On the contrary, redo bottleneck was present in both cases, even though it had different appearance (in one case it manifests via log file sync, in the other via log buffer space) and scope (in one case it affects committers only, in the other it’s all DML activity). As you can see from the load profile section of AWR in my post, redo per second is nearly identical in both cases, 30 MB/s (compare this to 2 MB/s for UNDO).

    I’m not even sure what you mean by “HW” — usually in Oracle this abbreviation means “high watermark”, but you probably meant something else?

    1. Hi Nikolay,

      HW = Hardware. If the DB slows down the application, then either:

      1) the hardware is insufficient
      2) the application is wrongly designed

      or both, of course.
      Concerning ORA-1555, well I have seen a statistically significant population of them, with my beastie that is 50 TB now, and I know what to do to cause or avoid them. And I wrote exactly that: if you have one (or more) long-running transaction (or cursor) while many others are committing frequently, then you’re going to run easily into “snapshot too old”. But if process A commits more frequently and closes its cursors (i.e. if ALL clients commit frequently), then it’s going to be less likely to occur. There is no absolute notion of “slow commit” and “fast commit”: the scale is always set by the application, and the slowest committing process sets the scale of “slowness”. 1s may be slow or fast, depending on the overall profile.
      In fact, using temporary tables reduces the redo log activity (obviously-large redo’s have a better disk access pattern than small ones) while bulk commits of many rows (instead of individual ones) decrease the overall transaction rate and reduce the likelihood of ORA-1555.
      I would say that an application that commits one row at a time, with, say, 100 ms frequency, while reporting queries or materialized views run for 10 minutes is just wrongly designed. The DB may need “to take a deep breath” sometimes, and a well-designed application must allow it. Absolutely, no DB must ever be time-critical to that extent (unless you deliberately want to scale up the cost of hardware by a factor 100, of course).
      Best regards

      Cristiano

      1. Cristiano,

        if you believe that there exist situations when it is necessary to commit more frequently to avoid ORA-1555, please send me a reproducible test case. I’ll be happy to analyze it in a separate post.

        Best regards,
        Nikolay

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s