Getting Data In

What's the best practice to ensure all your forwarders have forwarded all the events up to a certain timestamp?

Path Finder

Here's one possible solution I think would work if the there are constant events coming in from each source.

search source="a" | head 1 | append [search source="b" | head 1] | stats min(_time) as LatestReliableTime

How else would I know I have a complete picture of all the events from all sources up to some timestamp?

Super Champion

I'm borrowing from Mick's answer here. I just want to point out that you can use this metadata approach to really capture two different scenarios:

  1. Your forwarder stopped sending you data. (The original question)
  2. Your forwarder is sending you data from the future. (time travel?)

I would like to make the argument that both are of equal importance to monitor, for the following reasons:

  1. We want to know when a forwarder is down (obviously)
  2. It's important to detect when the system clock on a forwarder is out of sync with the indexer.
  3. Timestamp recognition problems need to be fixed ASAP.
  4. Not checking for future events inhibits your ability to detect when a forwarder goes down. (e.g. if an event is received for 10 days in the future then the lastTime will not be able to reflect the current time correct time until that point. Therefore if the forwarder goes down within that time frame, it will not be detected by an alert that only is looking for old events.)

Here is a search that can detect both situations:

| metadata index=_internal type=hosts | eval age=time()-lastTime | search age>60 OR age<-15 | sort age d | convert ctime(lastTime) | fields age,host,lastTime

There are a few things to note here:

  1. If you are forwarding internal events then the index=_internal will let you check for down forwarder on a much quicker interval. (Since metrics events are generated every 30-seconds). If you are not, then pick your most active index, or you could use something like | metadata type=hosts | append [ metadata index=os type=hosts ] | stats max(lastTime) as lastTime by host to query more than one index. (Try to make sure that you pick an index that is less-prone to timestamp configuration glitches which could prevent this alert from working properly.)
  2. Notice that I'm using time() here and not now(). This is because we want the current system clock instead of the time the search was scheduled to run. This allows for a more accurate value for age, which can otherwise be skewed due to delays is scheduled search execution. (Note that if your are running a version prior to 4.1, then you'll need to use now() instead of time(), and you will need to change the -15 number to something like -60 or -90, all depending on your potential scheduler delays)

Splunk Employee
Splunk Employee

Why not just use a metdata search, similar to the ones provided here? http://www.splunk.com/wiki/Deploy:HowToFindLostForwarders

| metadata type=hosts | eval age = now() - lastTime | search age > 86400 | sort age d | convert ctime(lastTime) | fields age,host,lastTime

The 'lastTime' field will tell you when Splunk last received an event from that host, and because you're searching the metadata, it should be a very quick answer.

Path Finder

lastTime still correspond to the last event that occurred for the host, not the last time it got an update from the forwarder, right? According to that link you sent, there used to be a "heartbeat" message sent from the forwarder. Why did that go away?

0 Karma

Splunk Employee
Splunk Employee

Ah, right, sorry! Yes, min is correct. So then, we have something like earliest=-2d | STATS min(_time) BY source | RENAME "min(_time)" AS tmin | STATS min(tmin) | CONVERT ctime, which produces a single answer -- the global minimum. Note the earliest=-2d, which keeps our search to only 2 latest days.

0 Karma

Path Finder

You mean min(_time) by source? I need to take the min to ensure that all the other source have forwarded their events. This makes the assumption that the source will continually have new events, what if there are no events from source B for days? Then that "reliable" time is going to be lagged behind.

0 Karma

Splunk Employee
Splunk Employee

Why not

stats max(_time) by source

?

0 Karma