Hi,
In this particular case there are 2 production servers running the same application, inside a JBoss EAP 6.1.1 container (Oracle/Sun JDK 1.7_25).
Both servers seemed to be doing excessing garbage collection, then we added the ''-XX:+DisableExplicitGC" JVM parameter on 1 of them.
Can we somehow find out where these GC invocations are coming from? The only difference between both servers is the 'DisableExplicitGC' parameter, without it, both JVMs have the same behaviour.
I included pictures for reference.
Bad:
Good (explicit GC disabled):
Regards,
Steven
Answer by Balazs B. ·
Hi!
If you would like to see which methods initiates the JVM garbage collection, my suggestion is to try to put a method sensor to the System.gc() and Runtime.gc() methods and you will be able to see all its occurances on the methods dashlet and it will allow you to drill down to the purepaths.
However I don't know lot about your application, give it a try to configure the RMI Garbage Collection Interval with these jvm parameters. Maybe it will help:
-Dsun.rmi.dgc.server.gcInterval=3600000 -Dsun.rmi.dgc.client.gcInterval=3600000 |
Please inform us about the results! Thanks.
Hi,
Sorry for not coming back with the definite cause, but it was the permgen space that was nearly full (over 90%), as I was thinking in my last comment. Causing a constant trigger of the CMS collector. So no one was calling System.gc() (or Runtime.gc()), but rather a badly configured JVM. since increasing the permgen a little, the issue didn't occur again.
FYI: the rmi interval is already at 'MAX_INT', as far as I know this would cause a full GC instead of a CMS cycle, at least in our configuration.
Regards,
Steven
Answer by Steven P. ·
Hi,
I now see this in another application, however this one is nearly idle as far as user requests are concerned. With such a low load, I see a pattern emerging.
Once a GC occurs on the young generation, the time needed on the old drops as well.
On this JVM, explicit GC is still allowed, but a CPU sampling doesn't reveal any suspects.
CMSClassUnloading is also eneabled.
Has anyone seen this behaviour before? How can this be explained? Looking at this, I don't think someone is calling System.gc, since that would also trigger a collection of the young generation if I'm not mistaken.
Regards,
Steven
Replying to myself...
I am suspecting that the perm gen space is not big enough (or there is a leak). This would mean the CMS constantly gets invoked to clear out some space on the perm gen, but fails to do so as everything is in use, so it would immediately trigger again.
Looking at the graphs for perm gen (custom in DT or using JConsole) I can see that it is nearly full, at this moment, of those 4 JVMS, 1 now has this issue, it is also the one with the highest amount of used perm gen space.
I have now increased the perm gen space for these 2 applications.
Answer by Steven P. ·
The CPU sampling doesn't bring in any likely suspect, the ones with a large execution time are 'wait(long)', socketRead0(), etc...
Sorting on 'CPU time' gets me the 'park(boolean,long)' method, as well as socketRead0, and things like that.
For now the application doesn't have that behaviour any more, which is strange in itself (the samples were taken when the issue was still present).
Should it reappear again, any other suggestions?
Not sure... from a strict Java performance engineering perspective, it's usually better to let the JVM manage the memory. Without knowing anything about your app it's hard to say but perhaps the DisableExplicitGC setting is appropriate for normal production runtime? andreas.grabner@dynatrace.com, the man who literally wrote the book may have more insight on that...
Rick B
I've been through that book, really useful.
It would surprise me that it is 'normal' to activate that option in production, I think it was added for admins like me who have to deal with unknown, possibly badly written, code. I do have access to the git repository though, a search through both code and history came up empty, except for a single test class (JUnit).
It's possible then that it's a lower-level method executing within a library. Back to a question Andi posed earlier: During the period when the issue was happening, do you see a lot of contribution from suspension (orange) in the PurePath or the Response Time Hotspots view? Can you attribute it to one method or one API?
Rick B
I did see a lot of suspension in the purepaths, and response time hotspots, but most of it was in the 'Servlet' api, that's the part every request comes through, and has the largest part of the response time anyway, about 80-90%. If I'm not mistaken, suspension time from one place also shows up in other places, so it is a bit harder to pinpoint on a busy server.
Please do correct me if I'm wrong on that assumption.
The data itself has already been cleared out of dynatrace, except for the CPU samplings. So I think I'll let this rest until it pops up again.
thanks for your input.
Answer by Andreas G. ·
I like Richards idea of the CPU Sampling. Those methods that show up as having high exec time are probably those calling it.
On the other side I would also look at those PurePaths that run long in that timeframe when the GC happens. Use the Methods Hotspot and see which methods take a long time. You might be lucky and find the problematic method as well. It could however be that gc() is called from a different thread that is currently not captured by a PUrePath. in that case the CPU Sampling is also the better approach
Answer by Steven P. ·
It seems the images are not properly shown, so I'll briefly describe them.
Both images are from the old generation heap of the JVM, the 'Bad' show the used memory as an almost straight horizontal line, and the GC involations as grey rectangles don't have any space in between them. In the 'Good' image, the used memory line has a clear 'up' trend, until it reaches approx 90%, then there is 1 grey bar for a GC, it drops down again to the same level as the used memory in the other image. Both images were taken with on 30 minute scale.
Here's how you can accomplish this: run a CPU Sample at highest sampling rate for a fixed period of time. If you are constantly seeing GC and expect it to happen at least a couple of times in the course of a minute, run it for 60 seconds. Otherwise run it for longer. My recommendation though is to configure it to run only for a specified period of time. I've seen cases where people have accidentally left this running in prod for days because they forgot it was running. Doc: Runtime Specific Dashlets and CPU Sampling
Typically the other approach would be to temporarily place a sensor on the method, but since this is a native method we can't do that here.
Hope that helps,
Rick B
JANUARY 15, 3:00 PM GMT / 10:00 AM ET