Performance monitoring and anomalies

Tech bloggers (myself included) tend to like writing about big issues, with massive impact, company’s reputation (or in extreme cases even very existence) at stake, etc. This adds dramatism to the story (and as an added bonus, helps us feel more important). I think it can also lead to a certain bias, making the IT community believe that only major issues matter. Whenever you spot a relatively harmless anomaly and try to find out more about it on a DBA forum, “is there an actual problem you’re trying to solve” is a very likely response. And if you honestly answer “not really”, you can expect a lecture on “obsessive tuning disorder” and a suggestion to stick to a simple principle, “ain’t broken don’t fix it”. I think this mentality was initially a sound reaction to some inquisitive minds trying to solve non-issues and occasionally creating issues out of nothing instead. When taking too far, however, this attitude becomes dangerous. Anomalies are important even without an immediate impact. Sometimes they are important even without any impact at all. In this post, I’d like to illustrate it with an example.

Before I can start discussing the anomaly in question, I have to talk a little bit about the monitoring framework I used to spot it. It’s based on ExaWatcher output — but instead of using ExaWatcher in form of reports, I asked a Unix SA to create a cron job to generate reports every hour, and then I wrote a set of Python scripts to convert the output files to csv format and push them to a Clickhouse database. To display the data in a convenient format, I built a web UI with R shiny, basically a webpage with a tab per each type of output (one for iostat, one for vmstat, one mpstat etc.), with a graph and a few controls for user inputs.

One morning, I noticed that the iostat tab the data for one of the compute nodes looked something like this:

and a few things about it appeared a little bit odd. For starters, there’s a (nearly) horizontal red line corresponding to the dm-0 device for the RSEC_S (data read in sectors per second, with sector=512 bytes) metric. Converting from blocks to megabytes, it corresponds to about 40 MB/s. Why would it read that much data all the time? And why does it stay so constant — it was either near 80,000 mark or at zero, but not anywhere in between. That was very different from the way pretty much all other metrics on the system behaved, with ebbs and flows marking increases and decreases in user activities and other aspects of the application’s life cycle.

Let me remind me that this is a compute node, not a storage cell where constant reads and writes are to be expected (although it would still have been weird to see as little variation as this). If I were to see any significant I/O there, it probably would have been writes (e.g. writing to trace files), but reads make less sense.

I dug a bit deeper, and it became curiouser and curiouser. The dm-0 device corresponded to the root file system. Why so much I/O against the root file system? I would expect most I/O on a compute node to have to do with various diagnostics — but trace files and listener logs were all residing on separate file systems. Also, other nodes on the cluster showed nothing similar to this behavior, why the difference? The load on the cluster was not perfectly balanced, but it still wasn’t enough to explain this peculiarity.

Eventually I decided to take an even closer look, so I picked a quiet period from the application activity point of view, and collected some perf data with block:block_rq_issue probe, which showed that the activity came from adrci — the command interpreter for the ADR (Automatic Diagnostics Repository). I then raise the issue with the production DBA who was able to determine that the I/O was due to adrci’s attempts to purge old data. As it turned out, whenever file system space usage went over 80%, it started looking for files to purge. However, retention policy limited the choice of files to dispose of to those older than 10 days, and when there wasn’t enough old files (which was the case most of the time), it simply went into an infinite loop, scanning for files to purge, and finding nothing. There were short periods of time when some data would have been deleted, bringing the used space below the threshold — that explained the brief periods of quietness.

The root file systems on compute nodes were increased to twice their original size, and this I/O activity stopped. And it made no difference whatsoever to the server’s performance. Does that mean that all of this was a waste of time? My answer is absolutely not, and let me explain why.

First of all, while in this case there wasn’t any actual impact, there easily could have been. Imagine, for example, that this I/O was not direct, but buffered. At 40MB/s it would only take a few minutes to fill up all free memory with useless data in the page cache. Surely, the pages would still be reclaimable, but other, useful I/O activity, may have been unable to find free pages for buffering. And when the pages would have been eventually reclaimed, there would probably remain enough “holes” in them so that it would become very fragmented, as we have seen with TSM backup.

Perhaps even more importantly, all this I/O generated a lot of noise in which useful signals would drown. If you have enough harmless anomalies polluting the system’s metrics, eventually they’ll end up hiding some harmful anomalies behind them.

There’s also another danger here. Many software bugs operate like binary weapons, when either component is harmless, but their combination is toxic. For example, I had a case where a compute node was experiencing very moderate levels of swapping and slow I/O on the swap device at the same time. On its own, either problem was probably below the detection threshold. Combined, they produced a major incident. By the way when I say “binary”, it’s just for the sake of simplicity, there can be more than two components. For example, in case of memory fragmentation, we can have as many as 4 or 5.

So how does one strike a fine balance between the dangers of neglecting an anomaly, and the dangers of obsessing over every single fluke? It’s not a simple question, but proper performance monitoring goes a long way to help with that.

If you only look at metrics when things go bad, that’s not very helpful. Take AWR reports for example — an isolated report is difficult to interpret. There are hundreds if not thousands of numbers, most of them having no obvious interpretation. 1,035,681 gets on latch X — is it too many, is it too little, is it just right? Pairwise report analysis (either by generating two reports, or by a single AWR diff report) is only a minimal improvement on that. You get two numbers for each metric instead of one. But how does one interpret their difference? Is it a trend, is it a random variation? What kind of scatter is normal for this metric? How much does seasonality factor in — are there any daily, weekly or monthly cycle in application’s and database activity?

The only way to interpret metrics reliably is by observing their dynamics over a significant period of time, and not by picking one or two points. This is true for AWR, this is also true for ExaWatcher, and for pretty much everything.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s