In my previous posts (e.g. here and here) I showed how to use ps output (e.g. from ExaWatcher) visualization to spot performance problems in Linux. Here I’d like to show that this approach can be taken a little bit further, namely, to find the source of increase in memory usage.
The R code for this is quite straightforward. I also think it shouldn’t be much of a problem to do the same in Python, although I haven’t gotten around to try it myself. In the code below read_ps is the function that reads in a ps output file from ExaWatcher without unzipping it, sum_and_tidy does the aggregation, and visualize_rss does the plotting. There is also an auxillary function keep_top_n which is needed to keep the number of color bands to something reasonable.
Continue reading “Where did my RAM go?”
In my last post, where I analyzed the problems caused by memory fragmentation on a Linux server, I said very little about memory fragmentation itself. I wanted to tell a story, so I had to dial down the technical stuff. But now that the story is told, I figured I should make a more technical post on the subject. I think it might be useful for many Linux DBAs or SAs, since there is not enough awareness of this problem within the Linux community. So if your system is experiencing cryptic performance and stability issues, keep reading, your problem may well be stemming from memory fragmentation.
Continue reading “Memory fragmentation: the silent performance killer”
Every upgrade is a bit of a lottery. But for a long running, well established system with a lot of legacy code, it can be more of a Russian roulette. You are unlikely to gain much — your system has been running fine, thanks to various tweaks the apps team applied here and there over the years. All these nice and shiny features that arrive with the upgrade — well, sure, there can be a few that you might like and have big plans for, but most of them are not going to be very useful, at least not out of the box. While there’s a good chance that some of them would backfire in a really, really bad way.
Continue reading “How to hang a server with a single ping, and other fun things we learned in a 18c upgrade”
Classic symptoms of memory pressure (free physical memory running low + swapping) are often more difficult to interpret than they seem, especially on modern enterprise grade servers. In this article, I attempt to bring some clarity to the issue. It is based on Linux, although many observations can be generalized to other Unix-like operating systems.
Continue reading “Swapping (paging) for DBAs”
When I first learned about Active Session History, it was a real game changer for me. It’s (kinda) like tracing which is always on, for every single session… well, active session, sure, but who cares about idle ones? For a while I got so obsessed with it that I almost stopped using other tools — fortunately, that was a only short while, because as great as ASH is, you still need other tools. But to the day, ASH still remains one of my favorites. However, to fully exploit its potential, you need to properly visualize its results, otherwise they can be misleading, as I intend to show in the rest of this post.
Continue reading “A picture paints a thousand words”
There is a lot of different tools for analyzing OS process states which can be helpful in resolving non-trivial performance issues. One of the limitations of such tools is that they are mostly active ones — i.e. you have to do some extra work to collect the desired diagnostic information. This is inconvenient when the problem you’re facing is intermittent and manifests itself on an irregular and unpredictable schedule.
Continue reading “ASH for OS processes”
Call stack profiling and flame graphs have been a hot topic in Oracle tech blogs last few years, and recently I got a chance to use it to troubleshoot an actual production performance issue. It was quite an interesting journey, with some twists and turns along the way. Let me start by presenting some background for the problem.
Continue reading “Finding the root cause of “CPU waits” using stack profiling”
In my previous article I discussed general questions related to network issues in Data Guard due to packet loss and/or retransmissions. Here I’d like to move to discussing specific tools and methodologies for troubleshooting such issues.
Such tools can be broken down by following criteria:
- server-side or network-side
- active or passive
- level of detail they provide (aggregate statistics or individual packet capture).
I think the first item on the list is more or less self-explanatory: there are tools that can be run on the server (either the sender, i.e. production, or the receiver, i.e. the standby), and there are tools that can be run on the network side. The latter aren’t always accessible to the DBA, but sometimes the data from such tools can be made available by the network team via some sort of a graphic user interface, or by request.
Continue reading “Tools for troubleshooting network performance issues”
In this article I describe the basic mechanics of TCP and DataGuard as well as relevant performance metrics on the database, OS and network sides. The idea is to give DBAs some ammunition in addressing DataGuard performance issues. The most important stage of troubleshooting is the correct identification of the nature of the issue, e.g. being able to tell whether the problem has to do with the network as such, or DataGuard, or Oracle database (primary or standby) or something else. Despite very powerful instrumentation provided by Oracle, it is not an easy task. But even after the network problem has been identified, it doesn’t necessarily stop here for a DBA. You’d think that at that point you’d be able to pass the problem onto a network administrator and wait until it gets resolved, but it doesn’t always work like that. Network issues can be mixed with a range of different ones, but more importantly, network can be a very complex system, so it helps a lot when network people know what exactly to look for. It is equally important for DBAs and SAs to understand the network specialists, because in all but most trivial cases, fixing network issues is an iterative process which requires constant feedback every step of the way. So it really pays for a DBA to speak network administrator’s language so to say.
Continue reading “Troubleshooting network throughput issues in Oracle Data Guard”
Last week I participated in Oracle’s Real World Performance event — four days of lectures, quizzes, live demos and hands-on exercises. It was quite interesting, even more so than I expected it to be.
Understandably, a lot of time was spent discussing the perils of row-by-row processing. After all, it was Real World Performance, so it was based on performance problems that the authors of the course faced most often. And many, if not most, performance problems in the real world come from poor coding habits, in particular, from OLTP or object-oriented mindset brought by inexperienced developers into DW world.
Continue reading “Set-based processing”