The plot thickens. Those of you who read my previous post on this issue will be gratified to know that a second customer with a "hanging server" problem found relief by troubleshooting the network. The symptoms where slightly different. On this server JRUN would peg at 90% to 95% but the server would stay "up". This caused a "paucity of processor capacity" so that other things could bring the server down. We did not suspect networking because no log file errors indicated networking problems. There were no socket errors or tcpip windows event log errors. Whenever we tested the database we found connectivity was "up" and capacity utilization was low. After my experience with the previous customer that was solved by resetting the link speed (see my previous post for more detail), we decided to take a closer look at networking.
What we found was an unmanaged switch sitting in between the DB server and the Web server. It's impossible to tell what was "really" going on because the switch was unmanaged, but I suspect a queue buffer was being overrun or perhaps the ports were re-negotiating periodically. Rebooting this switch resulted in an immediate "fix" of our problem. In other words, after rebooting the switch JRUN processor usage immediately dropped to acceptable levels. I'm wagering that JRUN was maintaining unused sockets through the switch that were "killed off" when the switch was rebooted.
As in the previous case, there is was nothing really "wrong" with the configuration. An unmanaged switch running 100mg full duplex and auto-negotiating ports should not create any particular problems unless all the network capacity was used up (it was not). Because there were no overt issues on the server there were no real clues in the various log files and stack traces to tell us about this problem - nothing we could "hang our hat on".
The only real common denominator was that in both cases the problem apparently had to do with port settings on the switch. In both cases it was a busy web server connecting via JDBC/TCP to a database server over an internal network. In one case it was through a decent "managed" switch and in the other case it was an "unmanaged" switch. I suspect duplexing or handshaking protocols, but I haven't had the time to try and duplicate the problem on a test network setup - and I don't even know if it is possible to duplicate it. My next step is to look for a network counter or utility that will help me examine the sockets and their activity. Part of my triage from this point forward will be to examine switching, link speeds and intermediate directors.