ColdFusion Muse

Jrun Processor Pegging Issues and Solutions

Lately I've been involved in a couple of troubleshooting sessions where JRUN on a CFMX server was causing 95% to 100% processor utilization. Unfortunately I have not yet stumbled onto a magic bullet for this. Tweaking memory settings, changing garbage collection routines, modifying the threads for the scheduler and the simultaneous requests all seem to help, and in some cases solve the problem. I have never found one single solution that solves this problem. It usually comes down to either JVM arguments or an external process (a database, queue, COM, FTP etc) that is causing a hanging request.

Today, however, I stumbled upon a solution that seemed to solve the problem immediately. If your processor spike is due to this specific issue then this seems to fix it. Keep in mind, that I'm basing that opinion on the fact that taking the following steps seems to have fixed a production server in my care - so take it for what it's worth.

When Log Files Attack

I got a call today from an important customer. This customer was running pretty well on a complicated system for over a year when suddenly, JRUN started running at 90 to 100 percent. After poking around I noticed that the default "out" log (/runtime/logs/*instance*-out.log) was quite large - over 500 megs. I surmised that Coldfusion might struggle a little bit trying to append to such a large file. We stopped the server, cleared the directory and restarted the server. Sure enough, Jrun went back to behaving. I'm quite sure this is not the fix for every situation, but it's definitely something to check - especially if the spike happens on a well-behaved server with no precipitous changes.

For more on log files and where to find them seem my post on Why is my Jrun Err log so big and my other post on Finding all those sneaky log files.

NOTE: There is a "day after" follow up to this post that is titled Fixing Locked Jrun Log files Without Restarting Coldfusion.

Comments
Steven Erat's Gravatar Having restarted ColdFusion in order to release the lock on the log file for deletion, you introduced an artifact to the experiment. It may well have been that restarting would have resolved it regardless of the log size.

Should you have the leisure to "experiment" again, you could copy and then overwrite the whole log file with a CFFILE statement block (assuming the request would run, perhaps using the jws thread pool if the jrpp pool is hung), which would effectively reinitialize the log without restarting. (Delete is not possible for Windows users since CF maintains a lock on it, but perhaps CF can CFFILE/copy/delete it).

Conversely, if you find the problem resolved by stopping and deleting the log, perhaps you could then reintroduce the log to see if the high CPU behavior comes back.

Someone else blogged today about large (8GB) server-out.log. Note that there is a JRun bug for only rotating the server-event.log but not the out or err logs. That's targeted for an upcoming release.
# Posted By Steven Erat | 5/2/06 7:18 PM
mkruger's Gravatar Steven,

Normally I would agree with you because agreeing with you has always made me look good (ha). But in this case we had already restarted CF several times with no change to processor utilization. The procs would spike as soon as CF reached a full head of steam. Only after removing the log file did the procs come down.

As for an overwrite with CFFILE - I suspect it wouldn't work... but I will experiment and see. It has the disadvantage of kill the log file - which sometimes contains useful information. But it would be a runtime solution and keep us from a server restart.
# Posted By mkruger | 5/3/06 8:11 AM
Steven Erat's Gravatar I didn't realize that for that specific server restarts alone had already been tried without success.

As far as zero'ing out the log with a cffile write, what I means was to do a copy of it first, i.e. default-out.log.1, then zero out default-out.log. On Unix/Linux this test would be even easier root could do this without dealing with file locks.

I'll see if I can generate a multi-GB out.log for testing. Of course, there may be other uknowns involved, but I could at least see if a large log alone is sufficient.
# Posted By Steven Erat | 5/3/06 9:41 AM
Steven Erat's Gravatar No luck. I used a new installation of ColdFusion MX7 server on RHEL4 on a 400MHz Intel with just 392M RAM because that machine was already slow enough to not obscure any potential differences in testing. It took 112 seconds to start up (yes, this is normal for that machine). Loading the CFAdmin for the first time took about 5-8 seconds, and then clicking on the first few menu links took about 1 second to load each. Clicking on those links again (after initial compile and loading) still took about 1 second (not a machine I use for its performance benefits :).

ColdFusion was stopped and started, and links were clicked again, just to repeat the "normal" condition.

Then I created a coldfusion-out.log by cat'ing other log info into it until it was 2GB. Starting ColdFusion took 117 seconds, so not much difference there. Loading the CFAdmin was equally slow at about 5 seconds for the login to complete. Then clicking the links again took about 1 second each. In sum, it appeared to be no difference. I even ran the code that writes JDBC driver versions to the out.log just to make sure ColdFusion had an open file handle on it. The version info showed up in the tail, but still no unusual behavior.

In the normal and test cases I monitored CPU and Mem with top, but there were no important differences. In the test case, when I stopped clicking around the CPU settled down to 0-5%.

Like many tests, problems are sometimes platform dependent, environment factor dependent, or particular to some combination of local factors for that server. So on your "test" server nulling the out log may have had an impact, but there may have been some other unique aspect to the server that exacerbated the problem. My simple test shows that at least as a baseline, a large out log is not sufficient by itself to cause high resource utilization.
# Posted By Steven Erat | 5/3/06 10:41 AM
Jim O'Keefe's Gravatar Mark, do you have rules set up in the cfadmin to limit log file size? My boss used this post to fix a cf8 production server over the weekend. He said the file size was huge, but cf is supposed to be archiving the log files once they reach a size threshold and it was just archived recently. It seems like there is some way that the archiving fails and the file continues to grow.
# Posted By Jim O'Keefe | 11/30/09 1:05 PM



Blog provided and hosted by CF Webtools. Blog Sofware by Ray Camden.