Getting Data In

Why are our Splunk indexes not showing all log entries from server?

gsonal03
New Member

We have a tomcat application generating logs on the server, which are then indexed and available in Splunk. For the past 3 weeks, we have noticed that abruptly, some logs are not visible under Splunk.

We also exported the actual Tomcat logs from the server, and see that the original log file contains all logs as expected, but Splunk is missing several statements.

What should be our troubleshooting approach for such an issue? I am unable to find any existing issues like this in open source community articles.

Tags (1)
0 Karma

rvany
Communicator

Just to clarify some things:

  • you have one logfile or more than one?
  • if more: these come from one server or more than one?
  • if more: are these in the same timezone - or - is the localtime for these servers set to the same?
0 Karma

nickhills
Ultra Champion

Ok so now that sounds like:

a.) A hugely delayed log ingestion problem - (less likely, but possible)
b.) Bad timestamp processing - (far more likely)

To see if its A or B.

Run a search for all of your events over whatever time period you need to 'see' them.
Add the following to your search:
"your base search"
| eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S")
| eval delta=_indextime-_time
| table _time indextime delta

Examine the difference between the _time and indextime in the table for the 'delayed' events (and the delta between those times in seconds).

If indextime is significantly later than your expected value for _time, this would suggest A
If _time is significantly later than indextime, then this would indicate B

Perhaps you could share some of the results, including the timings as noted above?

If my comment helps, please give it a thumbs up!
0 Karma

gsonal03
New Member

The difference is big, so it is "delayed log ingestion problem".

2019-02-28 09:38:25.402 2019-02-28 09:38:26 0.598
2019-02-28 09:38:10.868 2019-02-28 09:38:11 0.132
2019-02-28 09:37:55.963 2019-02-28 09:37:56 0.037
2019-02-28 09:37:41.743 2019-02-28 09:37:43 1.257
2019-02-28 04:39:40.263 2019-02-28 09:42:14 **18153.737**
2019-02-28 04:39:11.629 2019-02-28 09:41:47 **18155.371**
2019-02-28 04:38:58.574 2019-02-28 09:41:36 **18157.426**
2019-02-28 04:37:08.356 2019-02-28 09:39:44 **18155.644**

It occurred twice in past 2 weeks, so should we be looking at some underlying issue in splunk forwarder? Or are there any guidelines to prevent this issue to occur in production?

0 Karma

nickhills
Ultra Champion

All then index times are "close" at least to each other.
This means that all the events arrived at once (ish) so it could be an ingestion problem.
-Or-

What may be happening is that the event time is not being parsed properly, so although the events all arrived "on time" some of them were timestamped as being in the past/future- this means that you run a search, you won't see them until later (even though they are already there).

I would start looking at your index=_internal logs and see if you can see messages about blocked queues or any warnings about timestamp errors

If my comment helps, please give it a thumbs up!
0 Karma

gsonal03
New Member

Thanks, I will review this with internal team and get back.

Should we also change the throttle limits of the forwarder - I see such suggestions in another community article - https://answers.splunk.com/answers/579305/delayed-log-ingestion.html? The application does send multiple logs within span of few seconds. Not sure if this also causes the queue to get filled, if such is the case.

0 Karma

nickhills
Ultra Champion

The default maxkbps for a forwarder is 256kbps which is plenty in most cases.

I have only ever had to change this for really busy domain controllers sending millions of events a day, so whilst its possible your host is very noisy, i'd start by looking at your indexers to see if it really is just this one host causing issues, or you have a wider problem you have not noticed until now.

If my comment helps, please give it a thumbs up!
0 Karma

gsonal03
New Member

Unfortunately, we did not find reference to blocked queues in splunkd log. The only reference to queues occurred during shutdown messages. In the given timeframe where the logs went missing/delayed, we see info logs stating "TcpOutputProc - Connected to idx", nothing else.

We also wanted to review metrics.log based on the article https://docs.splunk.com/Documentation/Splunk/7.2.4/Troubleshooting/Aboutmetricslog, but the logs from last week are no longer available.
Now, I will need to wait until the problem reoccurs.

0 Karma

tsaikumar009
Explorer

this looks like issue with permissions on the some of the logpaths.

Please check if splunk has valid permissions to fetch the logs from the source.

0 Karma

gsonal03
New Member

It's the same log file from where some logs are being read and some are lost. So, it doesn't look like permission issue at this point.

0 Karma

nickhills
Ultra Champion

Common causes of problems like this could be as a result of bad event breaking.
From experience - tomcat exceptions can cause trouble here, because long exceptions can confuse the parsing/breaking process.

Two things to look at are truncation, and then subsequent breaking.
If your application triggers a huge exception, two things can happen.
1.) the exception is broken into multiple events instead of just one - normally around something which looks like a timestamp can cause this.
2.) the exception is parsed as one event, but truncated after 10,000 chars (by default) look at TRUNCATE

Following either of these these Splunk can struggle to spot the next correct point to break the next event.
Tools you can use to tackle this include setting a specific Timestamp extraction using TIME_PREFIX and TIME_FORMAT to ensure you only event extract the timestamp from the beginning of the message, then you can use BREAK_ONLY_BEFORE_DATE,BREAK_ONLY_BEFORE or MUST_NOT_BREAK_BEFORE
https://docs.splunk.com/Documentation/Splunk/7.2.4/Admin/Propsconf#Line_breaking

If my comment helps, please give it a thumbs up!
0 Karma

gsonal03
New Member

Thanks for your inputs. We reviewed the application logs and in the small timeframe of test run, where logs are missing, we dont see any exceptions. The 200 success logs itself are missing (rather delayed, as per current findings).

0 Karma

damann
Communicator

There are severeal approaches for troubleshooting:
Did you already take a look at your delta between the time your events occured and when the have been indexed into splunk?
try something like:
search for your tomcat logs | eval delta=_indextime-_time | timechart max(delta)

Did you take a look at your _internal index for Parser Errors?
index=_internal Err* OR warn*
and look at the specific areas of time where you think the unnormal behaviour begins.

I hope this gives you a first shot for solving your problem.

0 Karma

gsonal03
New Member

I will inform my organisation infra team to try these commands out. I don't have direct access to the server at this point.
In general, the observation is for a given timeframe also (as small as 30 seconds), some logs are visible, while next few lines are not visible in splunk (I am assuming they were not indexed correctly).

0 Karma

gsonal03
New Member

We did not find any parser error for the given timeframe.
Although, the team did notice that some of the logs were "forwarded" delayed by 3 hours, while others were forwarded at the same time. Therefore, even in splunk we see some of the missing logs available outside the actual tomcat server test run timeframe.
For e.g. In tomcat all the logs are seen between 4:30 to 4:35 pm, while in splunk the log statements are divided like below:
4:32 to 4:35 p.m. - statements 1-4, 10-15
6:30 - statements 5-8, and so on..

0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

 (view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...