Splunk Search

Could someone provide guidance on Lag found in splunk universal forwarders?

_pravin
Communicator

Hi Community,

 

I have two separate Splunk installs: one is the 8.1.0 version and another one is 8.2.5

The older version is our production Splunk install. I can see a lag in the dashboard set-up which calculates the difference between the index time and the actual time.

Since its production environment, I assumed that the lag might be due to the below reasons.

  1. The universal forwarder is busy as it's doing a recursive search through all the files within the folders. This is done for almost 44 such folders. Example: [monitor:///net/mx41779vm/data/apps/Kernel_2.../*.log]
  2. The forwarder might be outdated to handle such loads. The version used is 6.3.3
  3. Splunk install is busy waiting as there is already a lot of incoming data from other forwarders.

In order to clarify the issue, I set up the same in another environment. This is a test environment which does not have a heavy load as in production but has the same settings with reduced memory. When I set up a completely new forwarder, and replicate the setup in the test environment, I still see the same lag.

This is very confusing as to why it's happening?

Could someone provide me with tips or guidance on how to work through this issue?

Thanks in advance.

 

Regards,

Pravin

 

Labels (1)
Tags (1)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

There are many possible reasons for event lag. From the top of my head:

1) Interminnent network connectivity problems.

2) Not enough bandwidth (either restricted in limits.conf with thruput settings or you simply have low capacity network link)

3) There is no lag as such but your source's clock can be skewed

The lag caused by many directories to monitor is typically present only shortly after the start of the forwarder because it has to check all the dirs and files to verify whether their state is consistent with fishbucket database. After that it only checks new files. But you might need to raise your opened files limit to help your forwarder keep track of all those files.

Upgrading the forwarder is of course highly advised since 6.x hasn't been supported for some time already. It should work, but Splunk doesn't support it anymore.

_pravin
Communicator

Hi @PickleRick ,

 

I have checked for all the below conditions. We have a proper network setup with no internet issues and limits.conf has default settings. I doubted that the issue might be because of the skewed clock since we work in a different time zone than my original location. But all the servers had a common time zone so this is also not the case here.

Since the forwarder reads data from different log files across folders, the lag that we find is the maximum lag for a particular sourcetype. For example:

_pravin_1-1655677693037.png

In the above image, the event was generated at 9:00 in the morning but it was indexed only at 13:00 which is almost 4 hours after the event was generated.

For the same source, when the run the SPL after a few minutes for the same source as earlier, I notice a delay of only a few minutes.

_pravin_2-1655677920912.png

I don't understand if this is because of the load on the forwarder or some other issue.

For the same source, two different events have different indexing times. This is the one that really confuses me. Could you please throw some light on this?

 

Regards,

Pravin

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. I don't know about the lag but there's something fishy about your timestamp parsing.

Your raw events have two timestamps each. And in the first screenshot there is almost four hours difference between them 9:14 vs 13:05. I don't know what timestamps are those but they might be some kind of a "start timestamp" and "end timestamp". And for some reason it seems that index time is relatively close to the second one (which makes sense - something ends, the event is emitted, it's received by the forwarder, passed on to the indexer, indexed - there can be some slight delay) but the _time is being parsed from the first timestamp.

So it doesn't seem to be an issue of "lag" but more of a time parsing problem.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

I totally agree with @PickleRick that this is more probably issue with your event's time stamps than lag in connections.  Of course if there are lot of events coming from one UF then the throughput limit can hit, but based on your event time stamps (only couple with same second), I not expecting that.

One another issue with those timestamps are that those are not containing TZ information! If you have operations on different time zones then that could be also one reason for 1 or 0.5 times x hours differences between _time and _indextime.

If you have set up MC part on your node you can try to look with it those issues. It also told if there are some other issues in your input phase (Settings -> MC -> indexing -> inputs).

r. Ismo

0 Karma

_pravin
Communicator

Thank you @PickleRick.

I will check the different reasons listed by you and update this thread.

0 Karma

_pravin
Communicator

Hi @ITWhisperer ,

 

One is the index time and another one is the event time.

| eval lag_sec=_indextime-_time 

The difference is calculated as mentioned above.

 

Regards,

Pravin

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Where do the two times (index and actual) come from?

0 Karma

_pravin
Communicator

Both the times are available from the data.

 

 

 

0 Karma
Get Updates on the Splunk Community!

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...

Video | Welcome Back to Smartness, Pedro

Remember Splunk Community member, Pedro Borges? If you tuned into Episode 2 of our Smartness interview series, ...

Detector Best Practices: Static Thresholds

Introduction In observability monitoring, static thresholds are used to monitor fixed, known values within ...