We have a tomcat application generating logs on the server, which are then indexed and available in Splunk. For the past 3 weeks, we have noticed that abruptly, some logs are not visible under Splunk.
We also exported the actual Tomcat logs from the server, and see that the original log file contains all logs as expected, but Splunk is missing several statements.
What should be our troubleshooting approach for such an issue? I am unable to find any existing issues like this in open source community articles.
There are severeal approaches for troubleshooting:
Did you already take a look at your delta between the time your events occured and when the have been indexed into splunk?
try something like:
search for your tomcat logs | eval delta=_indextime-_time | timechart max(delta)
Did you take a look at your
_internal index for Parser Errors?
index=_internal Err* OR warn*
and look at the specific areas of time where you think the unnormal behaviour begins.
I hope this gives you a first shot for solving your problem.
I will inform my organisation infra team to try these commands out. I don't have direct access to the server at this point.
In general, the observation is for a given timeframe also (as small as 30 seconds), some logs are visible, while next few lines are not visible in splunk (I am assuming they were not indexed correctly).
We did not find any parser error for the given timeframe.
Although, the team did notice that some of the logs were "forwarded" delayed by 3 hours, while others were forwarded at the same time. Therefore, even in splunk we see some of the missing logs available outside the actual tomcat server test run timeframe.
For e.g. In tomcat all the logs are seen between 4:30 to 4:35 pm, while in splunk the log statements are divided like below:
4:32 to 4:35 p.m. - statements 1-4, 10-15
6:30 - statements 5-8, and so on..
Common causes of problems like this could be as a result of bad event breaking.
From experience - tomcat exceptions can cause trouble here, because long exceptions can confuse the parsing/breaking process.
Two things to look at are truncation, and then subsequent breaking.
If your application triggers a huge exception, two things can happen.
1.) the exception is broken into multiple events instead of just one - normally around something which looks like a timestamp can cause this.
2.) the exception is parsed as one event, but truncated after 10,000 chars (by default) look at
Following either of these these Splunk can struggle to spot the next correct point to break the next event.
Tools you can use to tackle this include setting a specific Timestamp extraction using
TIME_FORMAT to ensure you only event extract the timestamp from the beginning of the message, then you can use
Thanks for your inputs. We reviewed the application logs and in the small timeframe of test run, where logs are missing, we dont see any exceptions. The 200 success logs itself are missing (rather delayed, as per current findings).
this looks like issue with permissions on the some of the logpaths.
Please check if splunk has valid permissions to fetch the logs from the source.
It's the same log file from where some logs are being read and some are lost. So, it doesn't look like permission issue at this point.
Ok so now that sounds like:
a.) A hugely delayed log ingestion problem - (less likely, but possible)
b.) Bad timestamp processing - (far more likely)
To see if its A or B.
Run a search for all of your events over whatever time period you need to 'see' them.
Add the following to your search:
"your base search"
| eval indextime=strftime(indextime,"%Y-%m-%d %H:%M:%S")
| eval delta=indextime-_time
| table _time indextime delta
Examine the difference between the _time and indextime in the table for the 'delayed' events (and the delta between those times in seconds).
If indextime is significantly later than your expected value for _time, this would suggest A
If _time is significantly later than indextime, then this would indicate B
Perhaps you could share some of the results, including the timings as noted above?
The difference is big, so it is "delayed log ingestion problem".
2019-02-28 09:38:25.402 2019-02-28 09:38:26 0.598 2019-02-28 09:38:10.868 2019-02-28 09:38:11 0.132 2019-02-28 09:37:55.963 2019-02-28 09:37:56 0.037 2019-02-28 09:37:41.743 2019-02-28 09:37:43 1.257 2019-02-28 04:39:40.263 2019-02-28 09:42:14 **18153.737** 2019-02-28 04:39:11.629 2019-02-28 09:41:47 **18155.371** 2019-02-28 04:38:58.574 2019-02-28 09:41:36 **18157.426** 2019-02-28 04:37:08.356 2019-02-28 09:39:44 **18155.644**
It occurred twice in past 2 weeks, so should we be looking at some underlying issue in splunk forwarder? Or are there any guidelines to prevent this issue to occur in production?