There is a lot of different tools for analyzing OS process states which can be helpful in resolving non-trivial performance issues. One of the limitations of such tools is that they are mostly active ones — i.e. you have to do some extra work to collect the desired diagnostic information. This is inconvenient when the problem you’re facing is intermittent and manifests itself on an irregular and unpredictable schedule.
However, there are some great passive monitoring tools which can provide you with information about issues that happened in recent past at no additional cost. This is especially true for RAC setups where you normally have ExaWatcher (or OSWatcher for non-Exadata RACs) constantly running in the background and gathering OS-level information with high frequency.
In particular, you have ps snapshots which looks like this (caveat — it looks that way on Linux, I did not check how it looks on other OS’s!):
############################################################ # Starting Time: 03/27/2019 16:19:23 # Sample Interval(s): 5 # Archive Count: 720 # Collection Module: PsExaWatcher # Collection Command: /opt/oracle.ExaWatcher/FlexIntervalMode.sh '/bin/ps -eo flags,s,ruser,pid,ppid,c,psr,pri,ni,addr,rss,sz,wchan=WIDE-WCHAN-COLUMN,stime,tty,time,cmd' 5 720 196 ############################################################ zzz Count:720 [INFO] Flexible sampling model enabled. The sampling interval and count is subject to change due to system workload. zzz subcount: 0 F S RUSER PID PPID C PSR PRI NI ADDR RSS SZ WIDE-WCHAN-COLUMN STIME TT TIME CMD 4 S root 1 0 0 16 19 0 - 2020 4855 poll_schedule_tim 2018 ? 17:59:06 /sbin/init 1 S root 2 0 0 4 19 0 - 0 0 kthreadd 2018 ? 00:00:05 [kthreadd] 1 S root 3 2 0 0 19 0 - 0 0 smpboot_thread_fn 2018 ? 01:30:34 [ksoftirqd/0] ...
As you can see, there is some useful information in there, and first of all, WCHAN, which is the wait channel, i.e. the kernel function on which the process is currently sleeping. If your database processes get stuck because of some weird kernel locks, then the chances are, the database instrumentation is not going to tell you much, and this information in ps output can be of tremendous value (for example, it’s quite easy to see how it could have been used to diagnose the NUMA issue I described in my previous post).
In a way, WCHAN is like the wait event for the OS process, so ps output is pretty much like ASH, with samples of process states taken every 5 seconds! Of course, unlike ASH, this is just regular text which requires a bit of processing before it can be filtered and aggregated, but it’s not too difficult.
For example, I used this simple R script below to do the processing:
dir= "C:\\mydata" fname = "2019_03_27_16_19_23_PsExaWatcher_xxxxxxxx.xxx.xxx.com.dat.bz2" setwd(dir) con=bzfile(fname) lines <- readLines(con ) # identify lines that contain timestamps tstamp.idx <- grep('^ zzz .*$', lines) # identify lines that contain data data.idx <- grep('^(\\S*\\s){16}.*$', lines) # match data entries to timestamp entries i <- findInterval(data.idx, tstamp.idx) # parse data entries s <- str_match(lines[data.idx], pattern="^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(.*)$") # remove the first entry which is the header s <- s[,-1] # convert to a data frame df <- as.data.frame(s, stringsAsFactors=F) # get the timestamp entries and parse them using regex t <- lines[tstamp.idx[i]] tstamp <- gsub(pattern='^\\s*zzz subcount: \\d*$', x=t, replacement='\\1') # give meaningful names to the columns colnames(df) <- c('F', 'S', 'RUSER', 'PID', 'PPID', 'C', 'PSR', 'PRI', 'NI', 'ADDR', 'RSS', 'SZ', 'WCHAN', 'STIME', 'TT', 'TIME', 'CMD') # add the timestamp df <- cbind(df[-1,],tstamp) df <- df[-1,] df % mutate(tstamp=mdy_hms(tstamp)) # filter Oracle process and remove those sleeping on sk_wait_data d % filter(grepl(CMD, pattern="racle") & !grepl(WCHAN, pattern='sk_wait_data')) %>% group_by(WCHAN, tstamp) %>% summarize(x=n()) # take care of NA issue that leads to ugly gaps in the area graph d2 % tidyr::spread(key = WCHAN, value = x, fill = 0) %>% tidyr::gather(key = WCHAN, value = x, - tstamp) %>% arrange(tstamp, WCHAN) # produce the plot and call ggplotly to enrich it with interactive features p % ggplot(aes(x=tstamp,y=x,fill=WCHAN)) ggplotly(p+geom_area())
and then you can slice and dice the data any way you like, and even produce nice OEM-style graphs to visualize the wait channels for the subset of processes of interest:
It is also possible to parse these files using external tables using techniques for trace file analysis I described in my blog earlier which would allow you to query the contents with plain SQL same way you would with any data table.
Of course one of the big reasons to do trace file analysis on the database side is being able to do the table join with other data. In this particular case it’s not going to be always useful, because it is difficult to link processes to sessions. E.g. ASH doesn’t give you server pid’s (which is a shame), but there are still things that you can do, at least in some cases. For example, for batch jobs or background processes, the OS process can be identified using the CMD column.
So I think this tool can be a useful addition to active OS-side tools used in database performance analysis (like psnapper recently release by Tanel Poder). I would imagine it being even more useful for less thoroughly instrumented database management systems (like mysql), but I myself don’t do much work in that area so I haven’t tried it yet.
Hi Savvinov,
Thanks for the tool.
I tried to run the R program from the console and I am getting the below errors..
> dir= “C:\users\xxxxxx\Desktop\Logs\Files”
Error: ‘\u’ used without hex digits in character string starting “”C:\u”
> fname = “xxxxxxxx.04.19.2300.dat”
> setwd(dir)
Error in setwd(dir) : character argument expected
> con=bzfile(fname)
> lines
> # identify lines that contain timestamps
> tstamp.idx # identify lines that contain data
> data.idx # match data entries to timestamp entries
> i # parse data entries
> s # remove the first entry which is the header
> s # convert to a data frame
> df # get the timestamp entries and parse them using regex
> t tstamp # give meaningful names to the columns
> colnames(df) <- c('F', 'S', 'RUSER', 'PID', 'PPID', 'C', 'PSR', 'PRI', 'NI', 'ADDR', 'RSS', 'SZ', 'WCHAN', 'STIME', 'TT', 'TIME', 'CMD')
Error in `colnames # add the timestamp
> df df df % mutate(tstamp=mdy_hms(tstamp))
Error: unexpected input in “df % mutate(tstamp=mdy_hms(tstamp))”
>
> # filter Oracle process and remove those sleeping on sk_wait_data
> d % filter(grepl(CMD, pattern=”racle”) & !grepl(WCHAN, pattern=’sk_wait_data’)) %>% group_by(WCHAN, tstamp) %>% summarize(x=n())
Error: unexpected ‘>’ in “d % filter(grepl(CMD, pattern=”racle”) & !grepl(WCHAN, pattern=’sk_wait_data’)) %>”
>
> # take care of NA issue that leads to ugly gaps in the area graph
> d2 %
Error: unexpected input in “d2 %”
> tidyr::spread(key = WCHAN, value = x, fill = 0) %>%
+ tidyr::gather(key = WCHAN, value = x, – tstamp) %>%
+ arrange(tstamp, WCHAN)
Error in tidyr::spread(key = WCHAN, value = x, fill = 0) %>% tidyr::gather(key = WCHAN, :
could not find function “%>%”
>
> # produce the plot and call ggplotly to enrich it with interactive features
> p % ggplot(aes(x=tstamp,y=x,fill=WCHAN))
Error: unexpected input in “p % ggplot(aes(x=tstamp,y=x,fill=WCHAN))”
> ggplotly(p+geom_area())
Error in ggplotly(p + geom_area()) : could not find function “ggplotly”
>
Am I doing something wrong?
Regards,
RRA
Hi Rakesh, sorry for the late reply — I was on a holiday abroad. It looks like in your case the problem is escaping. One generally uses \\ instead of \ in Window paths for that reason. There might also be some libraries not loaded, e.g. %>% comes from dplyr.
Also, sadly enough, wordpress scrambles some of the special characters in the code, which leads to further problems, so you might spend a little time sorting it out (e.g. p % ggplot is wrong, it’s supposed to be %>% — but I don’t know if this is going to display okay in the comment text or the same thing would happen again!).
Generally, R code I share here is more intended to show the path, then to provide a packaged solution that would work out of the box. I simply don’t have the resources to do the latter, as I’m just a single person doing this on an unpaid basis on my leisure time, not a company with QA department etc.
Hope this helps. If you have further questions, don’t hesitate to ask!
Best regards,
Nikolai