We have some logs being written by a buffered writer. This means that occasionally (depending on when the buffer is flushed) an incomplete event is written to the log and indexed by splunk.
2010-09-16 00:00:02,874 INFO Some full log message here I am a multi line message that needs to be accounted for 2010-09-16 00:00:03,874 INFO Another full log message here 2010-09-16 00:00:04,874 INFO Partial log mess
Splunk happily indexes all three lines as an individual events. In actuallity I want splunk to index the two complete events and wait for the event to be completely written.
2010-09-16 00:00:02,874 INFO Some full log message here I am a multi line message that needs to be accounted for 2010-09-16 00:00:03,874 INFO Another full log message here 2010-09-16 00:00:04,874 INFO Partial log message completed now and should be indexed 2010-09-16 00:00:05,874 INFO Onto the next message that may be incomplete
I have been playing around with LINE_BREAKER but cannot get it to work as I desire. The partial line message is always indexed as an event. I suspect this is because the partial message will always be at the end of the log file and splunk breaks the event there. What are my options to make this work? Due to the amount of log traffic we generate flushing after each log message creates a noticable degredation in performance.
The closest I got was.
LINE_BREAKER = [\r\n]+((?=\d+-\d+-\d+\s+\d+:\d+:\d+,\d+\s+[A-Z]+))
This seems to break my multiline events correctly but the incomplete event that is the final line of the file is still indexed. It seems as if splunk hits the EOF and terminates the event its currently parsing.
Are you using a Light Forwarder to read this file? This won't happen if you switch to a regular forwarder (and you shouldn't need to mess with LINE_BREAKER). Alternatively, you can try to make sure you buffered writer flushes only at the end of an event. It does not have to flush after every event, but just not in the middle of an event. So, e.g., flush after every 200 events instead of every 100kb. This might complicate the code a bit, or else introduce a little wasted memory.
We are using the Light Forwarder. The standard forwarder can handle partial events? If I recall we were using the regular folder before but encountered a similar problem (I will need to test again to confirm). Flushing after N events may be an option. I will talk to our dev team and see if they are willing to write a custom log4j appender. Thanks for the tips.
What versions are at play?
In 4.0.x, a transient EOF (that splunk manages to reach) mid-event would generate a done key, which would shut down all the machinery dealing with the datastream, ensureing that the event cuts at that point. This wasn't ideal, but since this pattern of writes is uncommin, and since splunk's tailing machinery often wasn't on top of all files at all time, it wasn't that large a problem.
In 4.1.x, our tailing machinery is much much much more low latency, and so the strategy changed to wait for EOF to stay still for a few seconds before shutting down the machinery, so generally speaking should have this problem less often.
However, if you are getting this problem it's likely that part of your text is sitting in a application or libc buffer for a significant amount of time, possibly during a brief lull. The buffers can be of significant size (several kilobytes).
I think you just have to adopt the same behavior that most logging applications do: that of flushing at the end of your events. It's a good practice in general. If your app says "omg, huge bug happening, dying." and then crashes when that data is still in the buffers, you'll never see the message. Flushing should not have a significant cost, it's just the matter of handing that memory buffer from your app to the OS, not the matter of waiting for the write to complete.
We are running build 4.1.3. It's sounding like our application developers are going to work towards an appender that buffers and flushes at the event level rather than byte level as suggested by gkanapathy.
I was having a similar issue (see Log File Not Breaking Correctly)
This solved my problem:
time_before_close = <integer> * Modtime delta required before Splunk can close a file on EOF. * Tells the system not to close files that have been updated in past <integer> seconds. * Defaults to 3.