When I first learned about Active Session History, it was a real game changer for me. It’s (kinda) like tracing which is always on, for every single session… well, active session, sure, but who cares about idle ones? For a while I got so obsessed with it that I almost stopped using other tools — fortunately, that was a only short while, because as great as ASH is, you still need other tools. But to the day, ASH still remains one of my favorites. However, to fully exploit its potential, you need to properly visualize its results, otherwise they can be misleading, as I intend to show in the rest of this post.
There is a lot of different tools for analyzing OS process states which can be helpful in resolving non-trivial performance issues. One of the limitations of such tools is that they are mostly active ones — i.e. you have to do some extra work to collect the desired diagnostic information. This is inconvenient when the problem you’re facing is intermittent and manifests itself on an irregular and unpredictable schedule.
Call stack profiling and flame graphs have been a hot topic in Oracle tech blogs last few years, and recently I got a chance to use it to troubleshoot an actual production performance issue. It was quite an interesting journey, with some twists and turns along the way. Let me start by presenting some background for the problem.
In my previous article I discussed general questions related to network issues in Data Guard due to packet loss and/or retransmissions. Here I’d like to move to discussing specific tools and methodologies for troubleshooting such issues.
Such tools can be broken down by following criteria:
- server-side or network-side
- active or passive
- level of detail they provide (aggregate statistics or individual packet capture).
I think the first item on the list is more or less self-explanatory: there are tools that can be run on the server (either the sender, i.e. production, or the receiver, i.e. the standby), and there are tools that can be run on the network side. The latter aren’t always accessible to the DBA, but sometimes the data from such tools can be made available by the network team via some sort of a graphic user interface, or by request.
Active and passive in this context means that some tools passively report what’s going on on the network without having to generate any significant traffic of their own, and others have to actively probe the network, causing potentially significant performance overhead. Obviously, the latter are less safe to use, especially on a production environment. However, certain network characteristics, like bandwidth, are hard to measure without the active tools.
Finally, the tools differ in the level of detail they provide: while most only report aggregate statistics about the traffic they monitor, others can capture individual packets within the traffic. Obviously, packet capture provides much more detail, but as usual, it comes at a certain cost: a certain performance overhead, large amount of output data to be stored and processed, and as a consequence, not being able to leave it on for a long time.
Packet capture can be performed both on the server side and on the network side.
Some of the network tools commonly used when troubleshooting TCP performance are:
- netstat, ss and other utilities for displaying network statistics
- various ping-like utilities (the original ping, tcpping etc.)
- traceroute and its alternatives (e.g. mtr)
- various bandwidth benchmarking tools (oratcptest, qperf etc.)
- various sniffers for packet capture (tcpdump or its analogues on the server side, on the network side one typically uses hardware sniffers).
Netstat provides per-protocol statistics when called with the «-s» switch. There are sections for both TCP and IP protocols in there, but often most useful information was found in the TcpExt section, e.g. metrics related to retransmissions:
- fast retransmits
- forward retransmits
- retransmits in slow start.
A nice thing about netstat is that it’s packaged into other monitoring tools like ExaWatcher, so if such a tool is running on your system, you don’t need to set anything up – you can always pull netstat output files as far back as your retention parameters allow. It is of course very straightforward to pull the metrics of interest from the text files and visualize them using some kind of utility – I imagine grep/gnuplot probably being the simplest option, but I personally used R as it allows a lot of flexibility (I am planning to share my R scripts for analyzing netstat output in one of my later posts).
By plotting netstat metrics versus time and comparing the plots with those of other relevant metrics (e.g. average log file sync times) you can potentially draw some interesting conclusions, e.g. whether or not retransmits are likely to account for Data Guard delays, how big they are compared to the overall traffic rate, what kind of retransmit is more common etc. For an experienced Unix network specialist they also give important clues as to how various TCP congestion control mechanisms are performing, hinting which TCP parameters may need some fine-tuning.
What they cannot tell you, however, is whether or not retransmits are «spurious» (i.e. due to packet reordering as opposed to actual packet loss). Packet loss on the network can only be confirmed by either analyzing network-side stats (network devices reporting packet drops on their interfaces) if it’s available, or by analyzing packet capture (more on that later).
Things are different, though, if the packet loss is occurring on the server-side. Netstat output contains the kernel interfaces table, showing for each network interfaces metrics such as TX-OK, TX-ERR, TX-DRP and RX-OK, RX-ERR and RX-DRP, where TX and RX refer to transmission and receiving, correspondingly, while OK, ERR and DRP refer to succesfull transmissions, errors and packet drops.
However, in practice most of the packet loss is occurring on the network and not locally. Still, such metrics can be useful to eliminate local scenarios from the picture.
Ping, traceroute (and analogues)
Ping and traceroute are similar in that they are both simple lightweight tools that provide timings related to network performance between the given source and destination (traceroute additionally gives the breakdown of the overall time by encountered hops).
This makes these tools very attractive. Unfortunately, in addition to their obvious advantages, they are also subject to (much less known) limitations, especially if used without proper switches (so people can end up looking at a wrong interface, or even a wrong protocol subject to different prioritization on the network).
But even if everything is specified correctly, these tools can only provide information about latency, which often is not enough to study performance impact of packet loss (retransmissions) as it primarily affects the throughput.
Oracle offers a utility called oratcptest for measuring TCP latency and throughput. It was designed with Data Guard performance in mind, and allows specifying parameters of the Data Guard traffic such as typical transaction size (redo per transaction).
Like other network benchmarking tools, it measures TCP bandwidth and latency by probing the network by sending messages with a configurable size. As with all active network measurement tools, a considerable downsize is the additional pressure on the network due to the traffic generated by the tool.
When measuring network bandwidth, it is important to keep in mind that it is not a constant, so it is desirable to do the measurements continuously (but with a sufficient interval between the measurements to minimize the performance overhead), including the periods of bad performance. However, if the problem is indeed due to insufficient bandwidth, then taking measurements can make the symptoms much worse.
Therefore, a great care must be taken when using such tools. My preference is to only use them on a UAT setup (if the problem is reproducible on UAT) with sufficient monitoring in place so that it could be turned off quickly enough if there are suspicions of it contributing to the performance degradation.
Packet capture is the source of the most detailed information about the network problems possible. It allows to record all network traffic to files (binary or ASCII) which can then later be analyzed by various tools to spot anomalies.
On Unix platforms, the most commonly used utility for packet capture is tcpdump. By using appropriate switches, you can limit the capture to a specific network interface, or protocol, or destination, or source etc. The amount of data captured can be very significant (tens or hundreds of megabytes per second, or even more), so it is necessary to ensure sufficient free space in the output directory.
Large capture files can be difficult to analyze because analysis tools would typically try and load them to memory, and therefore it is a good idea to keep their size down to a reasonable value (I normally use 1GB). Another handy feature of tcpdump is limiting the number of output files, which would then be overwritten in a circular fashion, preventing from consuming all the free space. Refer to tcpdump online documentation for description of all switches (I’m also planning to cover usage of tcpdump to a greater detail if/when the time allows).
As with all tools, there is a certain amount of performance overhead involved, but from my experience it’s normally quite small and in our case it didn’t cause any particular concern even when using it on live production systems.
Packet capture data can be analysed to obtain global statistics for various aspects of TCP data transmission (including various TCP errors), but it can do much more than just that. For example, you can observe dynamics in time (i.e. view TCP throughput as a function of time), or inspect behaviour of TCP congestion control mechanisms (TCP window scaling).
In addition to analysing network packets statistically, you can work with individual packets, where you can potentially spot patterns that could have eluded the aggregate statistical picture. You can see how much time elapses between sending a packet and receiving an ACK, whether a particular retransmission was triggered by a certain number of duplicate ACKs or a timeout, and many other things.
Such analysis is even more powerful when using packet capture from both ends of the transmission, as it enables you to unambiguously distinguish retransmissions caused by actual packet loss from spurious ones.
Unix provides a large number of tools for troubleshooting network performance. These tools vary greatly in the level of detail they provide. Higher level of detail usually comes at a cost, such as a performance overhead or increased amount of effort required for setting up the tool and/or interpreting its results. The highest level of detail possible is provided by packet capture tools, ideally taken both at the sending and receiving ends at the same time.
In this article I describe the basic mechanics of TCP and DataGuard as well as relevant performance metrics on the database, OS and network sides. The idea is to give DBAs some ammunition in addressing DataGuard performance issues. The most important stage of troubleshooting is the correct identification of the nature of the issue, e.g. being able to tell whether the problem has to do with the network as such, or DataGuard, or Oracle database (primary or standby) or something else. Despite very powerful instrumentation provided by Oracle, it is not an easy task. But even after the network problem has been identified, it doesn’t necessarily stop here for a DBA. You’d think that at that point you’d be able to pass the problem onto a network administrator and wait until it gets resolved, but it doesn’t always work like that. Network issues can be mixed with a range of different ones, but more importantly, network can be a very complex system, so it helps a lot when network people know what exactly to look for. It is equally important for DBAs and SAs to understand the network specialists, because in all but most trivial cases, fixing network issues is an iterative process which requires constant feedback every step of the way. So it really pays for a DBA to speak network administrator’s language so to say.
Last week I participated in Oracle’s Real World Performance event — four days of lectures, quizzes, live demos and hands-on exercises. It was quite interesting, even more so than I expected it to be.
Understandably, a lot of time was spent discussing the perils of row-by-row processing. After all, it was Real World Performance, so it was based on performance problems that the authors of the course faced most often. And many, if not most, performance problems in the real world come from poor coding habits, in particular, from OLTP or object-oriented mindset brought by inexperienced developers into DW world.
Very often, significant performance benefits can be obtained by using some very basic knowledge of the application, its data and business rules. Sometimes even less than that: even if you are not familiar with the application logic at all, you can still use common sense to make some reasonable guesses that would get you a long way in improving query’s performance. Here is an example (based on an actual query that I had to tune today).
Database query tuning is mostly about getting better plans. Mostly, but not always. Sometimes, the problem has nothing to do with the plan, and you might need to get a bit creative to find a solution. In this recent case a query was showing a decent performance when running from SQL Developer, but it took about 5 times longer to complete when running from R. The plan was the same, so I knew that it was irrelevant. The R session wasn’t showing as active most of the time, so it was fairly clear that the problem was fetching data — i.e. it was fetching too few rows at a time which lead to a large number of roundtrips, and consequently, high waits on “idle” event “SQL*Net message from client”.
On one of the databases I’m looking after (18.104.22.168, Solaris, non-RAC), several different INSERT statements (all into tablespaces with manually managed segments) suffer from occasional hiccups. The symptoms are always the same: in one of the sessions, the INSERT gets stuck doing lots of single-block I/O against one of the indexes on the inserted table, and if other sessions are running similar INSERTs, they hang on enq: TX – index contention. The situation can last just a few seconds, but sometimes it’s much longer than this (several minutes), in which case the impact on the application is quite serious.
Imagine the following situation: you are supporting an application with many different components and a busy release cycle. One a Monday morning you find that quite a few processes in the database now run slower. Very soon, you find out that the slowdown is due to increased CPU time, but where to move from there? There is no evidence that CPU is too stressed, causing CPU queuing. You cannot isolate the problem to any specific PL/SQL procedure or SQL id, and there seems to be no relationship between affected SQL statements. You also check the changes that went in the last weekend — there are quite a few of them, but none seems to be particularly relevant. So what do you do?