How to measure latency in the forwarder layers?

BDein · ‎08-27-2021

Hi,

A lot of Splunkers knows how to measure common latency/timeskew in Splunk using _time and _indextime, but who knows to measure the latency in all steps from a UF on it way to the Indexer, where there could be more Forwarders underway to the indexers (Heavy, Intermediate etc), where latency could raise.

The question was really asked here: Indexing latency arising at forwarders? , but never answered.

Does anyone know how to nails this information?

My idea was somehow to enrich the data in every level, by adding every tier of forwarder to each event with its hostname, and its timestamp, in which way you always would be in control and know the exact source of eventual latency - if you can follow my approach?

Ie. Would it be possible to use INGEST_EVAL to add new fields on every new tier the event passes, like:

t<no>_host=<host>
t<no>_time=<timestamp>

This approach will likely also touch on cooked data, and to what extend it's possible to enrich these underway.

Let me hear your thoughts and ideas.

PickleRick · ‎08-27-2021

Well... I'd say that this measurement wouldn't be that useful. Even if you added timestamps at subsequent forwarders you'd not be able to tell whether you're getting stuck at parsing, output from one forwarder, input into other one. I think it's more useful to monitor the forwarder throuhgput and see if it's not falling below typical values.

One other thing - the typical _indextime-_time calculations only make sense if all your components have properly configured time source and produce properly timestamped data. I use it in general not as a measurement of latency in the whole event processing path but rather as a measurement of the source system's clock quality.

It's quite typical for windows events to be several secons or even minutes "behind realtime" in indexing events simply because the event acquisition process for windows (especially with WEF) works in batches.

Having said that, you probably could use ingest-time eval to add time()-based value to an event (didn't try it myself though!) but remember that you'd need to have indexed field(s) to store values for this time so it gets a little complicated here because you'd need to either have to create multiple fields in order to hold values for several tiers or have to manipulate multivalue fields (which I'm not sure is possible at ingest time and even if it was it'd be horribly inconvenient to use for further processing for stats)

bjarnedein · ‎08-27-2021

Hi @PickleRick ,

I completely follow you time sync (NTP) is crucial on all layers of data sources towards Splunk if correlations shall work and make sense.

I've worked a lot with time on quite a few customers sites, and way too many don't have control over their time and timezone.

To the question if this would make sense or not, there is no doubt to me, that it would, and for more than one reason:

Troubleshooting (mainly)
1. time skewing
2. latency
3. queuing

Just to name a few.

As of today it not possible to measure the latency between tiers, and you'll have to use other methods to find bottlenecks etc.

Yes - additional fields will be needed, and multi-value is a no-go and a mess as I see it.

The challenge is, as I see it: How to add additional fields to cooked data (which they will be on the first HFwd), do you know how to do that?
Is it possible to create an app on ie. the indexer (last tier), that will add the arrival at the indexer.

PS. I'm know this is the _indextime if, and only if there is no HFwd between the UF and the Indexer, else _indextime will be set at first HFwd, which is useless for this purpose.

@isoutamo - Your idea about adding an idea at splunk is fine, but my experience with it is no so positive, it's too slow in my opinion. The smartest and fastest ideas comes here 👍

BDein · ‎08-27-2021

Hi R.Ismo,

Thanks for your input, this is pretty much also my knowledge, but I'd like to hang around a bit, as there are some very creative guys out there, that might have some interesting ideas/workarounds etc😊

isoutamo · ‎08-27-2021

With rsyslog as a transport method we have added this kind of metadata on every hop to then begin of event, but base on splunk's methodology at least I didn't know any reasonable way to do it. Only thing which comes my mind now is: somehow you should store those to disk on every hope and then reread those, but at least for performance point of view this is not possible.
Anyhow I propose that you will do that Idea and tell it here, so we can give some points to it.
b.r. Ismo

isoutamo · ‎08-27-2021

Hi

If I have understood right how splunk manage stream on those steps this is not possible. You can add this timestamp only on first full splunk enterprise instances (e.g. HF or Indexer) on event path, but not on all of those.

I'm not sure if anyone have already create idea for that on https://ideas.splunk.com. You could try to find it there and if it's not there then just create new idea for it.

r. Ismo

How to measure latency in the forwarder layers?

heavy forwarder

indexer

intermediate forwarder

props.conf

universal forwarder

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

Alerting Best Practices: How to Create Good Detectors

Discover Powerful New Features in Splunk Cloud Platform: Enhanced Analytics, ...