Quick Tip on Identifying Semaphore Lock Problems
Semaphore locks are just bad news all around. The show up to tell you there are problems, but they don't exactly finger the villain. To make matters worse, it can be tough to ferret out the values that are provided. Bad news, and not enough of it.
If I'm using my statrep reports, I'll see the "sem.timeouts" value that lets me know, yes, I have timeouts. Only, the statistic collector pulls in a single number that represents the aggregation of all the timeouts. So, statrep doesn't give me the same detail as the actual statistic.
If I run "sh stat sem.timeouts" on my console, I'll get something like this:
Sem.Timeouts = 030B:202 039E:143 0294:96 03CE:52 0A0B:37 0931:32 4245:29 410F:24 03A1:21 03AB:8 5708:8 4117:7 29CA:4 0428:4 43E5:4 013A:3 039C:2 0256:2 03B3:2 0419:1 03A4:1 0111:1 039
It's only the last value which is recorded (39) in statrep, and what I need is the entire list. The way semaphore locks are written, the most common category is listed first with its incident count. So, on this server, there have been 202 occasions of a semaphore lock over the NIF collection. How do I know that "030B" matches up to the NIF? By setting up a debug variable and reading the semdebug.txt file. What I end up with is a table of semaphore events and their counts.
|
SEMAPHORE CODE |
INCIDENT COUNT |
|
SEMAPHORE LABEL |
|
“030B” |
202 |
|
NIF collection semaphore |
|
“039E” |
130 |
|
Logger I/O task semaphore |
|
“0294” |
91 |
|
Directory manager queue semaphore |
|
“03CE” |
52 |
|
Monitor manager: Overall control semaphore |
|
“0A0B” |
29 |
|
Session table semaphore |
|
“4245” |
27 |
|
open database queue semaphore |
|
“410F” |
24 |
|
?? |
|
“03A1” |
20 |
|
Logger append semaphore |
|
“0931” |
17 |
|
Task sync semaphore |
|
“03AB” |
8 |
|
NSF Pool semaphore |
|
“5708” |
7 |
|
BSAFE semaphore |
|
“4117” |
5 |
|
?? |
|
“29CA” |
4 |
|
Platform Statistics Data Collection semaphore |
|
“43E5” |
4 |
|
Lock Manager: Master lock hash table semaphore |
|
“0428” |
3 |
|
?? |
|
“039C” |
2 |
|
Logger DBCB control block semaphore |
|
“013A” |
2 |
|
?? |
|
“0256” |
2 |
|
NSF buffer pool container |
|
“03B3” |
2 |
|
?? |
|
“0419” |
1 |
|
?? |
|
“03A4” |
1 |
|
Logger buffer semaphore |
|
“0111” |
1 |
|
?? |
The question is, just how can I easily extract these values? Statrep doesn't hold them and I don't want to have to manually enter console commands to track the rise of specific semaphore categories. Are "open database queue" semaphores increasing faster than the "NSF buffer pool container?"
Yes, there is a way to do it, but it turns out the syntax is peculiar. You can actually capture console text to a redirected text file. Try it.
sh stat sem.timeouts > mystats.txt
Oops. My mistake. That syntax won't actually work. Sure, it looks good, but if you want to create a text file you'll need to take out a space. The correct syntax is "sh stat sem.timeouts >mystats.txt"
Now, it's possible to create a scheduled program document to run this query. I'll either write an agent to pick up the values into Notes, or because the server is running AIX I can grep/awk it into an html page.
Now, I have the fun part of figuring out, exactly what is causing my semaphore locks.
- 


Comments
I agree absolutely that RTF is much harder to process. It was just that it would allow an easier building of history, as you could keep everything in an NSF store over time.
The thing I'd missed was that you're on AIX. If you're always going to be on AIX, then you may as well use the cat command with your text file, to build much the same history. It'll work just fine.
That screenshot looks like it's a pretty useful tool - I look forward to hearing more about this tool! Sadly, I won't be at LotusSphere - but I hope you put your slides up afterwards!
Posted by Philip Storry At 04:38:17 AM On 12/20/2006 | - Website - |
Yes, it does give the complete statistic for sem.timeouts, but it puts it into a single RTF field (body). I don't find the mail-in stats report nearly as easy to parse as the statrep documents. Certainly, the mail-in stats document isn't going to display in a view without some agent crunching. However, the mail-in does do something nicer than the console redirect does: it allows a history to be easily built. I can't do ">>" instead of ">" with the console redirector to append new statistics.
For the most part, I'm happiest with the statrep report--there are just a few things that don't come over.
In my shop, not everyone in IT gets the Domino Admin tool, so I've created an application that parses out the statistics I want from statrep and marks those values which exceed a threshold ({ Link } -> flickr screenshot). I'll show at LS'07 for monitoring systems that get upgraded. Will you be making it to Orlando this year?
Thanks for your comment.
Posted by Jack Dausman At 03:44:46 AM On 12/16/2006 | - Website - |
Out of sheer curiosity, I just checked - the Stats task will honour a "please reply to" setting, and send the mail there instead. So you could have a mailbox that runs an agent on schedule, or even on new mail delivery, and processed the messages there.
At first glance, this seems worse because you've got to get the results from a rich text field rather than a text file. But actually, it's easier because:
1. Accessing a document in a database is far more portable than text file processing - if servers have different installed paths, or are on different platforms, your agent becomes far more difficult to maintain.
2. You can just stamp each recieved mail as processed, and append fields with the values you're interested in. Then build views which show the lock values over time. As the stats are aggregated since last startup anyway, this makes sense and saves you work.
Just some ideas...
Posted by Philip Storry At 06:09:55 PM On 12/15/2006 | - Website - |