Alerting

Doubts in alerts

Daniel11
New Member

1) What will you do when there is a delay in the indexer?
2) How long the delay period is? (Any maximum time cap is there or will you wait for the complete delay to be cleared in indexer)
3) Will you send any notifications regarding the indexer delay?
If yes i) What are the information you can include in that notification (Like any tentative time for the next alert schedule)
ii) If there is a continuous delay, so you missed 2-5 time intervals, can you send mail for each time period or a single mail with all the information?
4) If there is 2 hours delay in the indexer, did you check for the missed intervals after the delay is cleared, or else check only from the current time period? (For example, RunFrequency is 5 mins and there is a delay from 10 AM and it is cleared at 11 AM. Did you scan from 10 AM or from 11 AM?)

Labels (2)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Let me add one thing to your response. The delay between _indextime and _time does not necessarily reflect the delay on splunk processing. While _indextime is generated by splunk (and you do have your time synchronized across your splunk environment, right?), the _time value might be parsed from the event so it might not reflect the actual time the event was generated (because the time on the source time is wrong) or the time that the event was actually "picked up" by the whole ingesting pipeline because - for example - you might be batch-reading data for a whole day from a file that gets synced from a source once a day.

So the _time-_indextime delay might indicate many things, not just splunk-induced delays.

In my case there was a batch of sources which had around 2h delay because of wrongly set timezone. And for many of windows sources the delay was around 10-15 minutes because the events were retreived using WEF and it works by pulling events periodically.

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Daniel11,

in log indexing there's usually a little delay (usually max 30 seconds) because Forwarders send packets with a configurable frequency (default 30 seconds and usually this value isn't modified).

If you have more dalay between ingestion and indexing, there's something to analyze:

  • network congestion,
  • insufficient resources on Indexers,
  • big quantity of logs in one Forwarder.

First and second problems must be solved outside Splunk.

The third requires an analysis and it's possible to change some configuration parameters, e.g.:

  • by default max occupation bandwidth on Forwarders is 256 k/s,
  • if you have many syslogs, you could use more syslog servers,
  • etc...

Anyway you can analyze delays with a simple search like this:

index=index
| eval delay=_indextime-_time, indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S.%3N")
| table _time indextime delay

and put a threeshold to use in an alert, e.g. a delay greater than 60 seconds:

| where delay>60

so, answering to your questions:

  1. analyze delay and intervene (if possible),
  2. default delay should be less than 30 seconds, if more must be analyzed and find a solution (see above),
  3. see the above alert,
    1. it's a search, you can put in it all the information you need, usually: host, _time, indextime, delay,
    2. it's a search, configure it as you need,
  4. you can find an event after indexing, if an event isn't indexed isn't reachable,

I hope to have answered to all you questions.

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...