Finding the root cause of “CPU waits” using stack profiling

Call stack profiling and flame graphs have been a hot topic in Oracle tech blogs last few years, and recently I got a chance to use it to troubleshoot an actual production performance issue. It was quite an interesting journey, with some twists and turns along the way. Let me start by presenting some background for the problem.

Continue reading “Finding the root cause of “CPU waits” using stack profiling” →

Tools for troubleshooting network performance issues

In my previous article I discussed general questions related to network issues in Data Guard due to packet loss and/or retransmissions. Here I’d like to move to discussing specific tools and methodologies for troubleshooting such issues.

Such tools can be broken down by following criteria:

server-side or network-side
active or passive
level of detail they provide (aggregate statistics or individual packet capture).

I think the first item on the list is more or less self-explanatory: there are tools that can be run on the server (either the sender, i.e. production, or the receiver, i.e. the standby), and there are tools that can be run on the network side. The latter aren’t always accessible to the DBA, but sometimes the data from such tools can be made available by the network team via some sort of a graphic user interface, or by request.

Continue reading “Tools for troubleshooting network performance issues” →

Troubleshooting network throughput issues in Oracle Data Guard

Introduction

In this article I describe the basic mechanics of TCP and DataGuard as well as relevant performance metrics on the database, OS and network sides. The idea is to give DBAs some ammunition in addressing DataGuard performance issues. The most important stage of troubleshooting is the correct identification of the nature of the issue, e.g. being able to tell whether the problem has to do with the network as such, or DataGuard, or Oracle database (primary or standby) or something else. Despite very powerful instrumentation provided by Oracle, it is not an easy task. But even after the network problem has been identified, it doesn’t necessarily stop here for a DBA. You’d think that at that point you’d be able to pass the problem onto a network administrator and wait until it gets resolved, but it doesn’t always work like that. Network issues can be mixed with a range of different ones, but more importantly, network can be a very complex system, so it helps a lot when network people know what exactly to look for. It is equally important for DBAs and SAs to understand the network specialists, because in all but most trivial cases, fixing network issues is an iterative process which requires constant feedback every step of the way. So it really pays for a DBA to speak network administrator’s language so to say.

Continue reading “Troubleshooting network throughput issues in Oracle Data Guard” →

Set-based processing

Last week I participated in Oracle’s Real World Performance event — four days of lectures, quizzes, live demos and hands-on exercises. It was quite interesting, even more so than I expected it to be.

Understandably, a lot of time was spent discussing the perils of row-by-row processing. After all, it was Real World Performance, so it was based on performance problems that the authors of the course faced most often. And many, if not most, performance problems in the real world come from poor coding habits, in particular, from OLTP or object-oriented mindset brought by inexperienced developers into DW world.

Continue reading “Set-based processing” →

Dealing with a global increase in CPU usage

Imagine the following situation: you are supporting an application with many different components and a busy release cycle. One a Monday morning you find that quite a few processes in the database now run slower. Very soon, you find out that the slowdown is due to increased CPU time, but where to move from there? There is no evidence that CPU is too stressed, causing CPU queuing. You cannot isolate the problem to any specific PL/SQL procedure or SQL id, and there seems to be no relationship between affected SQL statements. You also check the changes that went in the last weekend — there are quite a few of them, but none seems to be particularly relevant. So what do you do?

Continue reading “Dealing with a global increase in CPU usage” →

Join predicate pushdown

When a query contains a regular or inline view, there are 3 basic strategies for the optimizer to choose from:

1) merge the view (no “VIEW” operation in the plan)
2) instantiate the view as the whole and join it to the rest of the query (the plan shows a VIEW “operation”)
3) push join predicates inside the view (the plan shows “VIEW PUSHED PREDICATE”).

Continue reading “Join predicate pushdown” →

Join cardinality

In my previous post I showed an example of how a query’s performance can be improved using the waste minimization technique. My focus was primarily on identifying and enforcing the correct plan, but I received some questions regarding the root cause of the problem: why the optimizer came up with a wrong join order? It’s a very interesting question, and it deserves a separate post so that it could be explored in detail.

Continue reading “Join cardinality” →

AWR analysis: another case study

A few weeks ago, I received a request to review an AWR report for a database suffering from instance-level performance issues. Here are the the key parts of that report (with some masking):

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
XXXX           XXXXX     XXXXX               1 10.2.0.5.0  NO  XXXX

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     65115 03-May-16 11:00:09       152      17.8
  End Snap:     65116 03-May-16 12:00:18       152      17.7
   Elapsed:               60.16 (mins)
   DB Time:            2,712.41 (mins)

Cache Sizes
~~~~~~~~~~~                       Begin        End
                             ---------- ----------
               Buffer Cache:     5,856M     5,856M  Std Block Size:         8K
           Shared Pool Size:     2,048M     2,048M      Log Buffer:    14,340K

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:          2,455,599.10             14,087.84
              Logical reads:            613,415.60              3,519.18
              Block changes:             12,238.64                 70.21
             Physical reads:             12,233.70                 70.19
            Physical writes:              1,517.54                  8.71
                 User calls:              1,159.19                  6.65
                     Parses:             39,080.15                224.20
                Hard parses:                 32.45                  0.19
                      Sorts:                708.22                  4.06
                     Logons:                  0.31                  0.00
                   Executes:             39,393.06                226.00
               Transactions:                174.31

 ...

Continue reading “AWR analysis: another case study” →

Hints

Oracle cost-based optimizer (CBO) is great, but sometimes it’s making wrong choices even when correct inputs are fed to it. In such cases, you need a tool to override CBOs choices, and one of the most popular tools is optimizer hints. The main reason they’re so popular is that they allow “quick-and-dirty” kind of fixes for performance issues (provided that query text can be altered). Other ways may be more reliable, but generally require more work, and who wants to do work that can be avoided? Unfortunately, there’s a well known downside to the hints — it’s very easy to run into problems if you only fix a part of the plan.

When this happens, hints can lead to terrible execution plans. For a long time, I’ve been looking for a good example to illustrate this problem, and finally this week I encountered a case which appears to be suitable for this purpose.

Continue reading “Hints” →

Workarounds

It’s been forever since I last shared any of my performance troubleshooting experiences at work. This week, I got a case that I think is worth publishing, and I decided to write about it in my blog. So, here we go…

A few days ago, I received a complaint about unstable performance of one of frequently running SQL reports on a 11gR2 database. Most of the time it completed within a couple of minutes, however, on certain occasions it took much longer than that, and once it even took over 20 minutes.

I have a special stored report in SQL developer for conveniently displaying key statistics from DBA_HIST_SQLSTAT which is very helpful as a first step when analyzing unstable SQL performance:
Continue reading “Workarounds” →