Solved: How to monitor Garbage Collector using Splunk?

Shashank_87 · ‎09-03-2019

Hi, I am trying to understand the slow responsiveness of my application and for that I am indexing the JMX logging to my Splunk.
I see it's getting ingested every 15 minutes and two types of Collectors are used - Concurrent Mark Sweep and Parallel Sweep.
I want to know -
1. Frequency of GC
2. Time taken by GC
3. Other stuff.

The Logs look like this and I have no idea how to achieve it using these logs. Any help is much appreciated.

{   [-] 
     colCount:   803    
     colTime:    448196 
     gCDuration:     0  
     gCEnd:  95413744   
     gCID:   803    
     gCStart:    95413744   
     gCThreadCount:  26 
     gcName:     ParNew 
     jvmDescription:     Prod Server 4  
     mbean_domain:   java.lang  
     mbean_property_name:    ParNew 
     mbean_property_type:    GarbageCollector   
     memPools:  [   [+] 
    ]   
}

{   [-] 
     colCount:   3439   
     colTime:    25609396   
     gCDuration:     null   
     gCEnd:  null   
     gCID:   null   
     gCStart:    null   
     gCThreadCount:  null   
     gcName:     ConcurrentMarkSweep    
     jvmDescription:     Prod Server 4  
     mbean_domain:   java.lang  
     mbean_property_name:    ConcurrentMarkSweep    
     mbean_property_type:    GarbageCollector   
     memPools:  [   [+] 
    ]   
}

Sukisen1981 · ‎09-03-2019

hi @Shashank_87
This is going to be a long post, but consider it.
Firstly, a bit on your GC logs. I don't think there are 2 different types of GC algos here. The first one, parnew is the eden generation clean up and the second one is CMS on the oldgen. So basically, these logs are from one single application/process on a server.
In CMS, eden generation clean up IS a stop the world event. So for your questions, I suggest the following:
1- Frequency of GC : Divide into 2 , eden(young) gen clean up and oldgen cleanup
so you use something like this |rex field=_raw "gcName:(?<eden>.*)"
And now, if you run the search for, say the last 1 hour, |stats count(eden) gives you the frequency of eden clean up for the last 1 hour. What time period you choose for your frequency calcs are best left to you. In the same way you can extract the frequency for the second set, the oldgen
2- Time taken by GC : I can see 2 fields here, colTime & gCDuration, which one do you want to use to determine the time taken by for 1 cleanup cycle? Assuming you use coltime , your code will be something like this - |rex field=_raw "colTime:(?<time>.*)" BUT i have doubts on this values, GC times are generally in milliseconds, in which case the value of your oldgen clean up time is too high! Your application will burst. So instead you might want to use the gcduration. This is specific to your JVM algo print out and Splunk can not inform you on what field / how your GC logs capture the GC duration.
3- Other stuff: Asuuming you come this far, there are a lot of stuff on analytics that you can run , for example you can run a predict command based on the coltime. So if i use something like this - |rex field=_raw "gCEnd:(?<heap>.*)" I can append a timechart to this |rex field=_raw "gCEnd:(?<heap>.*)"|timechart span=15min avg(heap) as avg_heap|predict avg_heap AS predicted_heap algorithm=LLT upper90=high lower90=low future_timespan=10 holdback=10
This will predict your next 10 gc clean up times (for a 15 mins interval, that comes to the next 2.5 hrs). To read more on the usage of the predict command and to customize your algo refer here - https://docs.splunk.com/Documentation/Splunk/7.3.1/SearchReference/Predict
I will still recommmend LLT , since this is a case which is not seasonal/cyclical but has a clear trend. You can then set a threshold of say 3-5GB as your total available heap and peer into 2.5 hrs in the future and see if your application memory is about to get exhausted.
Remember, each time a GC clean up happens the predict command will take the lower value of the heap into consideration.
Let us know how it goes 🙂

View solution in original post

Sukisen1981 · ‎09-03-2019

hi @Shashank_87
This is going to be a long post, but consider it.
Firstly, a bit on your GC logs. I don't think there are 2 different types of GC algos here. The first one, parnew is the eden generation clean up and the second one is CMS on the oldgen. So basically, these logs are from one single application/process on a server.
In CMS, eden generation clean up IS a stop the world event. So for your questions, I suggest the following:
1- Frequency of GC : Divide into 2 , eden(young) gen clean up and oldgen cleanup
so you use something like this |rex field=_raw "gcName:(?<eden>.*)"
And now, if you run the search for, say the last 1 hour, |stats count(eden) gives you the frequency of eden clean up for the last 1 hour. What time period you choose for your frequency calcs are best left to you. In the same way you can extract the frequency for the second set, the oldgen
2- Time taken by GC : I can see 2 fields here, colTime & gCDuration, which one do you want to use to determine the time taken by for 1 cleanup cycle? Assuming you use coltime , your code will be something like this - |rex field=_raw "colTime:(?<time>.*)" BUT i have doubts on this values, GC times are generally in milliseconds, in which case the value of your oldgen clean up time is too high! Your application will burst. So instead you might want to use the gcduration. This is specific to your JVM algo print out and Splunk can not inform you on what field / how your GC logs capture the GC duration.
3- Other stuff: Asuuming you come this far, there are a lot of stuff on analytics that you can run , for example you can run a predict command based on the coltime. So if i use something like this - |rex field=_raw "gCEnd:(?<heap>.*)" I can append a timechart to this |rex field=_raw "gCEnd:(?<heap>.*)"|timechart span=15min avg(heap) as avg_heap|predict avg_heap AS predicted_heap algorithm=LLT upper90=high lower90=low future_timespan=10 holdback=10
This will predict your next 10 gc clean up times (for a 15 mins interval, that comes to the next 2.5 hrs). To read more on the usage of the predict command and to customize your algo refer here - https://docs.splunk.com/Documentation/Splunk/7.3.1/SearchReference/Predict
I will still recommmend LLT , since this is a case which is not seasonal/cyclical but has a clear trend. You can then set a threshold of say 3-5GB as your total available heap and peer into 2.5 hrs in the future and see if your application memory is about to get exhausted.
Remember, each time a GC clean up happens the predict command will take the lower value of the heap into consideration.
Let us know how it goes 🙂

Shashank_87 · ‎09-03-2019

@Sukisen1981 - Hi Sukisen, Thanks for your detailed response. This is really helpful. I have some events like this as well (below) where the gcDuration and other values comes as null. what does this mean? Does it mean that GC didn't happen?
Also How does Splunk ingest this data? I mean it only ingest when GC occurs or is there anything else? because I can see events in Splunk every 15 minutes of interval.
Also how do i calculate the frequency of GC using these logs like in a day how many times GC is happening?

{ [-]
colCount: 4355

colTime: 35604114

gCDuration: null

gCEnd: null

gCID: null

gCStart: null

gCThreadCount: null

gcName: ConcurrentMarkSweep

jvmDescription: Prod Server 1

mbean_domain: java.lang

mbean_property_name: ConcurrentMarkSweep

mbean_property_type: GarbageCollector

memPools: [ [+]
]

}

{ [-]
colCount: 4522

colTime: 32193425

gCDuration: null

gCEnd: null

gCID: null

gCStart: null

gCThreadCount: null

gcName: ConcurrentMarkSweep

jvmDescription: Prod Server 2

mbean_domain: java.lang

mbean_property_name: ConcurrentMarkSweep

mbean_property_type: GarbageCollector

memPools: [ [+]
]

}

Sukisen1981 · ‎09-03-2019

Hi @Shashank_87

How you application logs gc depends on which toll you are using. Splunk is merely ingesting whatever you are forwarding it. Splunk has no role to play and neither is it transforming the data in any way. I suggest you have a quick chat with your JVM guy / GC tuning folks in your team 🙂
Why the gc duration is coming as null depends on what is printed out by your GC verbose application logs. Splunkis merely ingesting whats printed out by the GC logs.
I can see that your new example snow contain 2 odgen clean up events from the CMS algo for servers 2 and 4.
The 15 mins interval that you are taking about does not make sense if seen in isolation. Assuming you have a 3GB/4GB old gen heap, I have never seen a stable JVM which encounters a clean up on the oldgen every 15 minutes, that should happen once in a day or 2. But if you are now consuming GC events for ALL JVM processes(like for Prod Server 1, Prod Server 2 ....Prod Server n) , what you are seeing is the GC triggered by different applications.
so you use something like this to see the number of GC triggers in the last 24 hours - index=xxx earliest=-24h||rex field=x "gcName:(?<GC>.*)"|rex field=x "jvmDescription:(?<Server>.*)"|stats count(GC) by Server
This will give you a field count(GC) split by the prod servers for the past 24 hrs.
Regarding the time, I don't know how your GC algo is logging the output. Wild guess , if i treat coltime as milliseconds AND colcount as number of collections, then it makes some sense and the GC time comes to ~ 7 secs for your second example above. This is still too high in my opinion and I am not sure on this.
You want splunk to calculate something, build some predictions, thats ok but Splunk can not interpret what the source data values mean, so stuff like why gcduration is null, what does coltime mean is something you need t clarify with your JVM team who most probably enabled the GC logging in the first place

Shashank_87 · ‎09-03-2019

@Sukisen1981 - Ok that makes perfect sense. I think you have a brilliant way of explaining things. Thank you.
So I am assuming the application is logging gc logs every 15 minutes because that is what I can see in Splunk Events. Now from these events suppose I want to find out when was the last GC occured, Can i use the latest event for that?
And to calculate the frequency of GC triggers can i use this?
The below events are for the same application where I am using two GC algos
Event 1 -
{ [-]
colCount: 439633
colTime: 6451976

gCDuration: 13
gCEnd: 293498632

gCID: 439633
gCStart: 293498619

gCThreadCount: 13
gcName: ParNew
jvmDescription: Prod Server 1

mbean_domain: java.lang

mbean_property_name: ParNew
mbean_property_type: GarbageCollector

memPools: [ [+]
]

}

Event 2 -
{ [-]
colCount: 569

colTime: 545439
gCDuration: 7878

gCEnd: 293043228

gCID: 569

gCStart: 293035350

gCThreadCount: 13
gcName: ConcurrentMarkSweep

jvmDescription: Prod Server 2

mbean_domain: java.lang

mbean_property_name: ConcurrentMarkSweep

mbean_property_type: GarbageCollector

memPools: [ [+]
]

}

Can i just use this -
index=jmx sourcetype=jmx host="Prod Server*" jvmDescription="*" mbean_domain="java.lang" mbean_property_type="GarbageCollector"
| eval jvmgc=jvmDescription+"_"+gcName
| eval TypeOfGC = if(gcName="ParNew", "Young Gen Cleanup", "Old Gen Cleanup")
| chart count over host by TypeOfGC

Will this give me a correct count of how many GC triggered of particular algo type?

Sukisen1981 · ‎09-04-2019

hi @Shashank_87
your code should work, please try it out and confirm

Shashank_87 · ‎09-04-2019

Hi @Sukisen1981 - I tried the above code and this is the response i get where i get confused. Now does this mean that GC occurred in every 15 minutes?
Count is 2 because of 2 servers. I am also speaking to the person who has enabled this logging to confirm the time interval and will let you know what he says.

_time Old Gen Cleanup Young Gen Cleanup
2019-09-03 00:00:00 2 2
2019-09-03 00:15:00 2 2
2019-09-03 00:30:00 2 2
2019-09-03 00:45:00 2 2
2019-09-03 01:00:00 2 2
2019-09-03 01:15:00 2 2
2019-09-03 01:30:00 2 2
2019-09-03 01:45:00 2 2
2019-09-03 02:00:00 2 2
2019-09-03 02:15:00 2 2

Sukisen1981 · ‎09-04-2019

hi @Shashank_87

Let us know what you find out, although the data looks like oldgen GC happened once in 15 mins., that can not be true. At least I have not come across situations where old gne get cleaned up so frequently.
This might just be the other normal CMS based GC clean up, remember CMS is an algo that marks threads for clean ups in a concurrent fashion .
If this is the case then your use case changes and you might need to look at events only where, if you consider your gcduration as time in milliseconds it doesnt look like much and could be the CMS getting run in concurrence with the app.

Shashank_87 · ‎09-05-2019

Hi @Sukisen1981 yes that's correct. So i spoke to my Splunk admin about the source of this data and he said there is a script on Splunk which is pulling the logs from JMX server every 15 minutes so there is no source as such.
I am investigating this more but yes it is very very unlikely that the CMS is happening every 15 minutes.
Thank you so much for all the help you have provided. Atleast I got off to a start. 🙂
I will let you know once i find something. I am talking to application team as well.

How to monitor Garbage Collector using Splunk?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!