When I first learned about Active Session History, it was a real game changer for me. It’s (kinda) like tracing which is always on, for every single session… well, active session, sure, but who cares about idle ones? For a while I got so obsessed with it that I almost stopped using other tools — fortunately, that was a only short while, because as great as ASH is, you still need other tools. But to the day, ASH still remains one of my favorites. However, to fully exploit its potential, you need to properly visualize its results, otherwise they can be misleading, as I intend to show in the rest of this post.
There is a lot of different tools for analyzing OS process states which can be helpful in resolving non-trivial performance issues. One of the limitations of such tools is that they are mostly active ones — i.e. you have to do some extra work to collect the desired diagnostic information. This is inconvenient when the problem you’re facing is intermittent and manifests itself on an irregular and unpredictable schedule.
Call stack profiling and flame graphs have been a hot topic in Oracle tech blogs last few years, and recently I got a chance to use it to troubleshoot an actual production performance issue. It was quite an interesting journey, with some twists and turns along the way. Let me start by presenting some background for the problem.