Troubleshooting network throughput issues in Oracle Data Guard

Introduction

In this article I describe the basic mechanics of TCP and DataGuard as well as relevant performance metrics on the database, OS and network sides. The idea is to give DBAs some ammunition in addressing DataGuard performance issues. The most important stage of troubleshooting is the correct identification of the nature of the issue, e.g. being able to tell whether the problem has to do with the network as such, or DataGuard, or Oracle database (primary or standby) or something else. Despite very powerful instrumentation provided by Oracle, it is not an easy task. But even after the network problem has been identified, it doesn’t necessarily stop here for a DBA. You’d think that at that point you’d be able to pass the problem onto a network administrator and wait until it gets resolved, but it doesn’t always work like that. Network issues can be mixed with a range of different ones, but more importantly, network can be a very complex system, so it helps a lot when network people know what exactly to look for. It is equally important for DBAs and SAs to understand the network specialists, because in all but most trivial cases, fixing network issues is an iterative process which requires constant feedback every step of the way. So it really pays for a DBA to speak network administrator’s language so to say.

This is why in this article I focus mostly on the network, rather than database, aspects of DataGuard performance troubleshooting.

A couple of caveats at this point: as far as the database side of things is concerned, I only worked with DataGuard on 11g, so that’s what I’m basing my observations on. Obviously, many things would be different in a different version, but I do believe that fundamental things change little, so most of it must be still applicable. Also, the article deals mostly with the synchronous replication mode, since DataGuard performance issues aren’t as disruptive when running in the asynchronous mode.

How DataGuard works

I’m going to paint a picture of how DataGuard works with very broad strokes — just the absolute minimum to give this article necessary logical consistency. For details please refer to official DataGuard Documentation.

On the primary database, we have user foreground sessions producing the redo by making changes to data (i.e. executing DML), and writing it to the log buffer, and LGWR process writing it from log buffer to redo log file. Whenever a database transaction is committed (or log buffer gets full, whatever happens first), the redo for this transaction needs to be written to the redo log file. But in case of DataGuard, redo will also have to be written to the remote standby. This is done by either LGWR or ARC process (I believe for synchronous replication LGWR is the only possibility, but I’m not entirely sure). In reality, however, most of the actual work is done by helper processes, such as NSSn.

On the standby side, the transported redo is written to the standby redo log file by RFS process. Redo is then applied to the standby database by PRn processes. When committing a transaction, LGWR on the primary must receive a confirmation that redo has been written to all destinations (including remote ones), but it is not necessary for it to be applied. Thus DataGuard performance is more sensitive to delays in redo shipment, rather than redo application (although it is possible to come up with a scenario when slow redo apply on the standby could also become a limiting factor for primary database’s performance). Redo transport, however, is not limited to the network transfer, it also includes the local write on the standby.

Note that LGWR process on standby is essentially irrelevant to primary’s performance, since standby is a read-only database (i.e. redo processing on the standby side inevitably revolves around applying redo from the primary as it doesn’t produce its own redo, which means there is little for standby LGWR to do).

Finally, a few words about instrumentation for above mentioned processes. On primary, usual tools are available: AWR, Active Session History (both the V$/GV$ version and the DBA_HIST one) etc., but on standby options are much more limited. The reason is, obviously, same as above — standby is a read only database, and cannot have any «actual» data of its own. So e.g. if you run a query against a DBA_HIST view on standby, you’d simply be querying primary’s AWR data. X$ tables and V$/GV$ views based on them are, however, different, so they (and first of all, V$ASH/GV$ASH) become the main source of diagnostic information on the database side. Of course, if the standby database is not open, even that is not available.

OS diagnostic data can also be very helpful, such as OSWatcher (or ExaWatcher, for Exadata), obviously even more so when standby is not open.

Symptoms of DataGuard-related delays

When synchronous DataGuard is added to the setup, redo processing becomes much more complex, and dependent on more processes, more branches of internal database (or DataGuard) code and more elements of infrastructure. So there’s many more places where things can go wrong — or if we specifically focus on performance, where things can go slow. It’s not just the network itself, it’s also DataGuard code itself (certain things are done differently for a standby database, so you become exposed to a whole new set of bugs), local I/O on the standby as well as other things.

So, what do DataGuard-induced delays look like on the primary database? Most commonly, it’s log file sync waits. I imagine that they could also manifest themselves via other redo-related waits, such as «log buffer space», but in practice, if the primary database commits frequently enough, it’s going to be primarily «log file sync» waits, since commit introduces a mandatory sychronization point for redo processing.

It is relatively straightforward to tell DataGuard-related log file sync waits from «local» ones (i.e. originating from some slowness on the primary database itself, like slow local redo I/O): you just look at what LGWR was doing during that time using either AWR, ASH or trace files. If you see DataGuard-related wait events such as «LGWR-LNS wait on channel», it’s the DataGuard, otherwise it’s probably something local.

But narrowing it down to DataGuard is not enough, since there are many very different scenarios within that realm of possibilities, such as:

  • slow network
  • a DataGuard bug
  • slowness on the standby side (e.g. slow I/O).

To narrow it down further, you need to know what other processes (first of all, NSSn and RFS) were doing, using either ASH or other tools.

If the slowness is on the network side, then NSSn would often wait on «LNS wait on SENDREQ», while RFS could be either idle (not even knowing that some data are being sent over the network), or waiting on something like «SQL*Net more data from client» (if it received the first portion of data but it takes long for the rest to arrive). Of course, there may be other possibilities as well. In this article, I will focus on the network scenario for DataGuard performance delays.

Network characteristics

From the performance point of view, the most important characteristics of a network are bandwidth, latency and packet loss. These are not really independent of each other, but to have a full picture of network performance, it is necessary to look at all of them. One of the most common sources of confusion with respect to network performance is wide-spread perception that this is something that can be described with a single number (often measured by ping or similar utility). That is far from being true. More likely than not, the single-dimensional approach is going to lead one astray. So let’s consider key characteristics of the network in a bit more detail one by one.

Latency is the delay introduced by the network. It characterizes passage of small messages through the network (typically in both directions — round trip time). Bandwidth is the maximum throughput the network sustains. It describes performance of continuous streams of information. Packet loss, as the name suggests, describes how many packets are lost in the network (typically because of congestion).

For reliable protocols like TCP, packet loss does not mean actual data loss, since all lost packets will eventually be retransmitted. However, the overhead due to detection and re-sending of lost packets affects both the available bandwidth and the effective latency (more about that later).

Two most important things to remember about network performance characteristics:

  • there is more than one
  • they change in time (i.e. a test performed during a quiet period does not necessarily yield numbers that can describe the system at other times)

This needs to be taken into account when analyzing network performance — i.e. network performance characteristics need to be analyzed in their entirity, and the measurements need to be taken when performance degradation is observed and not just at arbitrary moments in time or when it is convenient to do so.

TCP protocol

TCP allows two hosts two exchange continuous ordered streams of data that consists of small packets. Each packet has a number of attributes, such as the position of the packet in the conversation stream (SEQ), length and others. The receiver sends acknowledgement numbers (ACKs) to indicate which packets have been received successfully. An ACK is essentially the highest SEQ that was received to that moment without any gaps. So if host A sends packets 1, 2, 3, 5 and 6 to host B, then host B sends back following ACKs: 1, 2, 3, 3 and again 3. Packets 5 and 6 remain unack’ed because they come after packet 4 which is missing. So when duplicate acks («dupacks») arrive, this is indication that a packet has been possibly lost and might need a retransmission.

In some cases it is also possible for packets to arrive to the destination out of order (when network paths are parallelized). This will also lead to duplicate ACKs. E.g. if in the example packet 4 arrived later (after packet 6) rather than just got lost, the ACKs to the point when it finally arrives would still be the same: 1,2, 3, 3, 3. When the gap is finally filled, ACK number increases to to 6.

Since TCP must guarantee for all data to be successfully delivered, lost packets need to be retransmitted. There is no way to tell lost packets from those that are simply delayed, so TCP must use various mechanisms to make an educated guess. One way is by using a retransmission timeout (RTO) which dynamically changes, but for Linux it’s at least 200 ms, which is can be a long time, especially for a low-latency high-speed network. To improve performance, TCP also tries to recover from packet loss using duplicates ACKs, which will normally be faster (hence the term «fast retransmit»).

Retransmits are very important because of congestion control in TCP. The idea is that excessive traffic can cause congestion in one or more network devices, so when TCP sees signs that can be interpreted as packet loss, it decides to plays safe and reduce the rate of data transmission. This is done by by controlling the size of so-called TCP congestion window, which is basically the number of packets-in-flight, i.e. how many non-ACKed packets there can be at a given moment. With each «congestion event», the size of the TCP congestion window is reduced. With each consecutive ACK, the size of the congestion window increases. Therefore, the frequency of congestion events determines the effective throughput level. For example, if the average size of the congestion window is 10 MSS-sized (1500 bytes, or 1460 without the header info) packets, and the round trip time is 5 ms, then the transmission will be limited to 14,600 bytes per 5 ms, or 2,920,000 bytes per second.

The details of how congestion window starts growing after a congestion depends both on the nature of the event and implementation of the choice of TCP congestion algorithm.

It has been shown by Mathis that regardless to those details, the upper limit for TCP throughput in presence of packet loss is given by the following formula:

MSS/RTT/sqrt(p),

where RTT is the round trip time and p is the packet loss. So in a network with MSS=1500 bytes and RTT=1ms, 0.01 (i.e. 1%) packet loss rate means that TCP throughput would be limited to about 6.6 MB/s.

Packet loss on a particular connection isn’t always easily found. One way would be to request the readings of packet drop counters on the devices in the path of the DataGuard traffic from your network administrators. Unfortunately, these counters can be misleading. It is also possible to obtain thsi information in netstat output (part of ExaWatcher if you’re on Exadata). Unfortunately there is no breakdown by network interfaces or other criteria, so if retransmissions also occur in other traffic, that would obviously become a problem for using netstat.

A more reliable way is to do the packet capture with a tool like tcpdump, since with tcpdump you can apply numerous filter (by source, destination, interface etc.) to look just at the traffic you want.

Single-side packet capture can only show retransmissions, but doesn’t give one enough information to determine whether they are «normal» ones coming from genuine packet loss, or spurious ones coming from packet reordering or late arrival. In order to get the full picture, it is necessary to collect tcpdump on both sides, and then either join the data sets (which gives more accurate results, but requires much more work, and also is quite demanding in terms of computational resources), or carefully compare the aggregates on both sides (i.e. look at multiple appearances of same packets on send and receive sides).

I am planning to cover that in a separate article, so this is it for now. Let me summarize the main points:

  • In synchronous DataGuard replication mode, performance delays on primary due to slow shipment of redo over the network are possible (and even likely)
  • Performance of networks is primarily described in terms of bandwidth, latency and packet loss
  • For redo shipment, the main concern is network bandwidth, which must match the peak rate of redo generation on the primary
  • DataGuard normally uses TCP protocol (it may be possible to use other protocols, but I haven’t seen such cases in practice). For TCP, packet loss limits the bandwidth via so-called congestion control.
  • Other factors like packet reordering, can mimic packet loss in the eyes of TCP congestion control algorithms, and therefore can also affect TCP throughput
    Quantitatively, the impact of packet loss (or spurious retransmissions) on TCP througput can be estimated using the Mathis model
  • Therefore in order to know whether your network bandwidth is affected by packet loss or other congestion events, you need to keep track of the rate of packet retransmissions — ideally, separately for various scenarios
  • There are various sources of diagnostic information for observing congestion-like symptoms in network traffic, such as counters on network interfaces of the network switches, netstats output in ExaWatcher/OsWatcher etc.
  • But the most complete and accurate picture for network congestion and congestion-like events can be obtained by collecting and analyzing packet capture on both sides.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s