We are trying to forward verbose Java garbage collection log files (java version "1.6.0_34") using Java's "-XX:+PrintGCDetails
" option to get detailed information.
There is a bug in Java where the last line of the log does not get flushed until the next garbage collection occurs. So a log line that should be one event in Splunk gets split into two:
The top event is the rest of the line from the bottom event and comes in an hour later because that's when the next GC happened (there are newer events not in the screenshot).
The gc log file problem is described really well here.
Has anyone else encountered this or similar issues and found a good solution or work around?
I have tried to craft custom LINE_BREAKER
property logic to no avail. I have tried playing around with the time_before_close
property with no luck. I'm not sure time_before_close
is even a fesible solution since some of our Java apps have long periods of time between garbage collections.
Using Splunk 5.0.1.
What I did was to reassemble the two halves of the GC event using transactions.
Here's an example:
sourcetype=PIA_stdout ("GC" OR "secs")
| transaction endswith=secs keeporphans=true
| search date_hour>7 date_hour<18
| rex field=source "/webserv/(?<environment>[^/]+)/servers/"
| rex field=_raw "(?<starting_occupancy>[0-9]+)K->(?<ending_occupancy>[0-9]+)K\((?<heap_size>[0-9]+)K\), (?<pause_time>[0-9.]+) secs\]"
| stats max(pause_time) AS "Max Pause" BY environment
In this case, I didn't care about "after hours" GCs, but I couldn't filter on date_hour in the first search because the second half of the events (the part with the "secs]" at the end) didn't get date_hour fields. I had to assemble the two halves into a transaction before I could do a date_hour filter.
Also, when the GC event took less that apx 3 seconds, it didn't get split. So, I need keeporphans=true to keep the non-jacked-up GC events.
The time_before_close
setting in inputs.conf
is (short of reformatting the file entirely like tpederson suggests) is really you best/only option. This has worked for many people before - but you will need to set time_before_close
to a value greater than your longest GC. (Which I know could be minutes worst-case, this really sucks)
I have a solution to this that may not be pretty, but it seems to be working for us. It looks like your JVM outputs the timestamps, so our solution is a little different, but can work for both with some modification.
Our JVM wouldn't output timestamps for love nor money and Splunk didn't like the fact that Java would output half a line at the start of GC then output the rest later when it felt like it.
So, I found a python script that takes the modified date and calculates the timestamp from the seconds in the log file. I paired that with a bash script that goes through our JVMs and lumps all the files together, ignoring any unfinished lines that don't have a linebreak.
Original python script is at the bottom of this page
Here's my modified version. We write the date to the file name every time the JVM starts to prevent the modified date being changed. So, my version uses the string from the end of the file name.
#!/usr/bin/env python
import sys, os, datetime, time
# true if string is a positive float
def validSeconds(str_sec):
try:
return 0 < float(str_sec)
except ValueError:
return False
#make sure you can get a datetime from the filename
def getDateTime(str_datetime):
try:
return datetime.datetime(*(time.strptime(str_datetime, "%Y%m%d%H%M%S")[0:6]))
except ValueError:
sys.exit(1)
# show usage
if len(sys.argv) < 2:
print "Usage: %s <gc.log>" % (sys.argv[0])
sys.exit(1)
file_str = sys.argv[1]
#print file_str[-14:]
lastmod_date = getDateTime(file_str[-14:])
#print lastmod_date
#sys.exit()
file = open(file_str, 'r')
lines = file.readlines()
file.close()
# get last elapsed time
for line in reversed(lines):
parts = line.split(':')
if validSeconds(parts[0]):
break
# calculate start time
start_date = lastmod_date - datetime.timedelta(seconds=float(parts[0]))
# print file prepending human readable time where appropiate
for line in lines:
if not "\n" in line:
break
parts = line.split(':')
if not validSeconds(parts[0]):
print line.rstrip()
continue
line_date = start_date + datetime.timedelta(seconds=float(parts[0]))
print "%s: %s" % (line_date.isoformat(), line.rstrip())
Now, I paired this with a bash script that goes into each JVM dir, finds all the java.log files and feeds them to the python script. The output from the python script is conglomerated into a tmp file, which overwrites the last file. I'm not 100% sure, but I think this is the best way to avoid duplicate entries.
#!/bin/bash
for instance in `find /usr/JVMs/ -maxdepth 1 -type d -name "jboss.*" 2> /dev/null`; do
#echo "I: ${instance}"
for file in `ls ${instance}/server/ourserver/log/java.log.* 2> /dev/null`; do
LINECOUNT=$(cat ${file} | wc -l)
#echo "${LINECOUNT}"
if [[ "${LINECOUNT}" > "0" ]]; then
/usr/local/bin/convertTimeStamp.py ${file} >> ${instance}/server/ourserver/log/splunkGcc.tmp
mv ${instance}/server/ourserver/log/splunkGcc.tmp ${instance}/server/ourserver/log/splunkGcc.log 2> /dev/null
fi
#echo "F: ${file}"
done
done
This bash script is in cron, and runs every minute on each of our servers. Splunk is pointed to the one amalgamated file. Hope this helps, I know our devs were chomping at the bit to get some charts from this data.
No we never did find a work around for this.
I'm running in to the same problem but didn't know why it was happening until reading your post. Have you found a workaround?
Rather than monitoring JVM GC logs , you might want to consider using Splunk for JMX to poll for Garbage Collection metrics.The Splunk for JMX app comes with configuration examples for GC MBean attributes and also some example dashboards for visualizing GC metrics.