Tools for troubleshooting network performance issues

In my previous article I discussed general questions related to network issues in Data Guard due to packet loss and/or retransmissions. Here I’d like to move to discussing specific tools and methodologies for troubleshooting such issues.

Such tools can be broken down by following criteria:

server-side or network-side
active or passive
level of detail they provide (aggregate statistics or individual packet capture).

I think the first item on the list is more or less self-explanatory: there are tools that can be run on the server (either the sender, i.e. production, or the receiver, i.e. the standby), and there are tools that can be run on the network side. The latter aren’t always accessible to the DBA, but sometimes the data from such tools can be made available by the network team via some sort of a graphic user interface, or by request.

Active and passive in this context means that some tools passively report what’s going on on the network without having to generate any significant traffic of their own, and others have to actively probe the network, causing potentially significant performance overhead. Obviously, the latter are less safe to use, especially on a production environment. However, certain network characteristics, like bandwidth, are hard to measure without the active tools.

Finally, the tools differ in the level of detail they provide: while most only report aggregate statistics about the traffic they monitor, others can capture individual packets within the traffic. Obviously, packet capture provides much more detail, but as usual, it comes at a certain cost: a certain performance overhead, large amount of output data to be stored and processed, and as a consequence, not being able to leave it on for a long time.

Packet capture can be performed both on the server side and on the network side.

Some of the network tools commonly used when troubleshooting TCP performance are:

netstat, ss and other utilities for displaying network statistics
various ping-like utilities (the original ping, tcpping etc.)
traceroute and its alternatives (e.g. mtr)
various bandwidth benchmarking tools (oratcptest, qperf etc.)
various sniffers for packet capture (tcpdump or its analogues on the server side, on the network side one typically uses hardware sniffers).

Netstat

Netstat provides per-protocol statistics when called with the «-s» switch. There are sections for both TCP and IP protocols in there, but often most useful information was found in the TcpExt section, e.g. metrics related to retransmissions:

fast retransmits
forward retransmits
retransmits in slow start.

A nice thing about netstat is that it’s packaged into other monitoring tools like ExaWatcher, so if such a tool is running on your system, you don’t need to set anything up – you can always pull netstat output files as far back as your retention parameters allow. It is of course very straightforward to pull the metrics of interest from the text files and visualize them using some kind of utility – I imagine grep/gnuplot probably being the simplest option, but I personally used R as it allows a lot of flexibility (I am planning to share my R scripts for analyzing netstat output in one of my later posts).

By plotting netstat metrics versus time and comparing the plots with those of other relevant metrics (e.g. average log file sync times) you can potentially draw some interesting conclusions, e.g. whether or not retransmits are likely to account for Data Guard delays, how big they are compared to the overall traffic rate, what kind of retransmit is more common etc. For an experienced Unix network specialist they also give important clues as to how various TCP congestion control mechanisms are performing, hinting which TCP parameters may need some fine-tuning.

What they cannot tell you, however, is whether or not retransmits are «spurious» (i.e. due to packet reordering as opposed to actual packet loss). Packet loss on the network can only be confirmed by either analyzing network-side stats (network devices reporting packet drops on their interfaces) if it’s available, or by analyzing packet capture (more on that later).

Things are different, though, if the packet loss is occurring on the server-side. Netstat output contains the kernel interfaces table, showing for each network interfaces metrics such as TX-OK, TX-ERR, TX-DRP and RX-OK, RX-ERR and RX-DRP, where TX and RX refer to transmission and receiving, correspondingly, while OK, ERR and DRP refer to succesfull transmissions, errors and packet drops.

However, in practice most of the packet loss is occurring on the network and not locally. Still, such metrics can be useful to eliminate local scenarios from the picture.

Ping, traceroute (and analogues)

Ping and traceroute are similar in that they are both simple lightweight tools that provide timings related to network performance between the given source and destination (traceroute additionally gives the breakdown of the overall time by encountered hops).

This makes these tools very attractive. Unfortunately, in addition to their obvious advantages, they are also subject to (much less known) limitations, especially if used without proper switches (so people can end up looking at a wrong interface, or even a wrong protocol subject to different prioritization on the network).

But even if everything is specified correctly, these tools can only provide information about latency, which often is not enough to study performance impact of packet loss (retransmissions) as it primarily affects the throughput.

Microbenchmarking tools

Oracle offers a utility called oratcptest for measuring TCP latency and throughput. It was designed with Data Guard performance in mind, and allows specifying parameters of the Data Guard traffic such as typical transaction size (redo per transaction).

Like other network benchmarking tools, it measures TCP bandwidth and latency by probing the network by sending messages with a configurable size. As with all active network measurement tools, a considerable downsize is the additional pressure on the network due to the traffic generated by the tool.

When measuring network bandwidth, it is important to keep in mind that it is not a constant, so it is desirable to do the measurements continuously (but with a sufficient interval between the measurements to minimize the performance overhead), including the periods of bad performance. However, if the problem is indeed due to insufficient bandwidth, then taking measurements can make the symptoms much worse.

Therefore, a great care must be taken when using such tools. My preference is to only use them on a UAT setup (if the problem is reproducible on UAT) with sufficient monitoring in place so that it could be turned off quickly enough if there are suspicions of it contributing to the performance degradation.

Packet capture

Packet capture is the source of the most detailed information about the network problems possible. It allows to record all network traffic to files (binary or ASCII) which can then later be analyzed by various tools to spot anomalies.

On Unix platforms, the most commonly used utility for packet capture is tcpdump. By using appropriate switches, you can limit the capture to a specific network interface, or protocol, or destination, or source etc. The amount of data captured can be very significant (tens or hundreds of megabytes per second, or even more), so it is necessary to ensure sufficient free space in the output directory.

Large capture files can be difficult to analyze because analysis tools would typically try and load them to memory, and therefore it is a good idea to keep their size down to a reasonable value (I normally use 1GB). Another handy feature of tcpdump is limiting the number of output files, which would then be overwritten in a circular fashion, preventing from consuming all the free space. Refer to tcpdump online documentation for description of all switches (I’m also planning to cover usage of tcpdump to a greater detail if/when the time allows).

As with all tools, there is a certain amount of performance overhead involved, but from my experience it’s normally quite small and in our case it didn’t cause any particular concern even when using it on live production systems.

Packet capture data can be analysed to obtain global statistics for various aspects of TCP data transmission (including various TCP errors), but it can do much more than just that. For example, you can observe dynamics in time (i.e. view TCP throughput as a function of time), or inspect behaviour of TCP congestion control mechanisms (TCP window scaling).

In addition to analysing network packets statistically, you can work with individual packets, where you can potentially spot patterns that could have eluded the aggregate statistical picture. You can see how much time elapses between sending a packet and receiving an ACK, whether a particular retransmission was triggered by a certain number of duplicate ACKs or a timeout, and many other things.

Such analysis is even more powerful when using packet capture from both ends of the transmission, as it enables you to unambiguously distinguish retransmissions caused by actual packet loss from spurious ones.

Summary

Unix provides a large number of tools for troubleshooting network performance. These tools vary greatly in the level of detail they provide. Higher level of detail usually comes at a cost, such as a performance overhead or increased amount of effort required for setting up the tool and/or interpreting its results. The highest level of detail possible is provided by packet capture tools, ideally taken both at the sending and receiving ends at the same time.