ie. process each event once and only once, but at least once.
An example search might be:
index=_internal sourcetype=splunkd earliest=0 latest=now _index_earliest=-2m@m _index_latest=-1m@m | rename _indextime AS _time | bucket span=1m _time | stats count by _time
If we have this as a savedsearch run on a 1 minute schedule saving to a summary, we should get a count of events that were indexed between 1 min and 2 mins ago.
This number should never change right? ie. There should never be new events with an index time into the past.
If I ran the same search with fixed epoch times, the numbers should remain the same time after time?
earliest=0 latest=now _index_earliest=1471218487 _index_latest=1471218547
What I am finding is that this isn't the case.
index=_internal sourcetype=splunkd | rename _indextime AS _time | bucket span=1m _time | stats count AS raw_count by _time | append [search index=summary | timechart span=1m values(count) as summary_count] | stats values(*count) AS *count by _time | eval diff=raw_count-summary_count
I understand events can be late and still be inflight and would be inserted back into the timeline, but that is based on _time search limitations and not _indextime. Raw event latency is inside the normal earliest and latest times so shouldn't be culled.
The _indextime value should be an ever increasing number, so I don't know how events can have old values added to it so that these searches produce different number of raw events.
I suspect that your real problem is poor data curation regarding timestamping. Install this and you will see:
But just seeing the problems is not even half the battle; you have to figure out how to fix it, too. We do a great deal of PS Health Check work helping clients untangle this messy problem.
You are mistaken about what how these settings work. Let's look at an even broader version of your search:
index=_internal sourcetype=splunkd earliest=0 latest=+1000d _index_earliest=-2m@m _index_latest=-1m@m | rename _indextime AS _time | bucket span=1m _time | stats count by _time
This is, in essence an
All time search looking for events for
sourcetype=splunkd that were indexed between 1 and 2 minutes ago. Assuming that you in production, you will have continuous events being indexed so the events that are being indexed now, will show up in the search which will run 1 minute from now. In other words, this search should ALWAYS return a non-zero number. I am very confused why you would think that it should NEVER do so. I think the confusion is what
_indextime is. This is the OSes current timestamp on the indexer that indexed the event. So in an ideal world, the value for
_time would always be the same as
_indextime but the reality is that there is always delivery delay getting the events to the indexers so
_time will always be older/smaller than
_indextime. Even so, both values will slowly drive into the past as
Thanks for the reply but I think you misunderstood the question.
Given a fixed window (for testing purposes) I was able to see different numbers of events depending on when the search was run. Fundamentally incorrect results from a users point of view.
The question really boiled down to “Why should the number of events returned change for the same period of time?”. Indextime is (as you said) ever increasing so you should never see new events with old indextimes being inserted way into the past (which is what we were seeing).
My original use case was forwarding of critical security events. These needed to be sent once, only once, but atleast once.
The nature of Splunk is that it is difficult to craft a search to forward these in a timely manner without the use of a real-time search (disabled on our environment).
Seeing as a fair chunk of time has passed since the original question I changed my methodology and took the events in flight before it hit the indexing tier. It was too inconsistent for what we needed to do.
For other users that wanted search time processing it can still be an issue for out of order / late events. So it still made for an interesting question on designing a search to handle this.
Now that being said, I believe that there is/was hardware latency across a number of different hardware and software components in different indexers that was effecting my “time to search”. And as a result consistancy in results returned if it wasn’t taken into account.
I did also find a bug with Multisite index clustering in which I could prove that hot events were not streaming to the other site until the bucket rolled to warm. This resulted in having to switch to using site0 on search head cluster members.
I’d had previously logged tickets about trying to know what my “time to search” is as it differs from The basic _time - _indextime calculation.
This stat doesn’t exist in any meaningful way as far as I know. It would be nice to know for example that within xx seconds of an event being indexed is it being returned in search events.
When you run a search running with the time range picker (and it's something other than) all time, you're searching between whatever earliest marker and "now". _indextime represents the epoch at which the data was indexed, and the _time field (and time range picker) governs the parsed time for the events. Your specific example using fixed epoch times for _indextime still has a moving event time window.
If you want to find things that were indexed in a given window, regardless of their parsed time stamp, use latest=+Infinity.
At the root of the problem, however, might I suggest an accelerated data model that you use to drive a summary index after an interval. The data arriving "out of band" will be accelerated by the model whenever it arrives, and then later you can summarize those events after even the laggards have a chance to be ingested (and accelerated by the model).
Splunk doesn't understand the use of "+infinity" for latest.
"Invalid value "+Infinity" for time term 'latest'"
ie. index=internal earliest=0 latest=+Infinity _indexearliest=-25m@m indexlatest=-20m@m
I understand what you are saying and I think you might be on the right track that events are possibly being parsed with future time stamps.
I can however use something like this latest=+1000d
I'll see if that works.
Yeah, I apologize; it seems that +Infinity is "magic" in savedsearches.conf (there's an example in Fire Brigade), but I agree that it doesn't work in the search bar. Good catch on using a lot of "future" days. I've even had to use latest=+2y (years!) for some badly formatted data.
There is a delay in getting the events indexed in Splunk but there can be additional delay in making the data searchable (data that gets indexed is not immediately searchable). So to account for that delay, could you run your test with
_index_earliest=-10m@m _index_latest=-9m@m (allowing more time for Splunk to index and make indexed event searchable) and see if the results are correct/improved?
Are you in a distributed environment, and is the time correct on all your indexers? (I'm also now curious as to where in the pipeline _indextime is assigned... and if heavy fowarders would add to the mix)
distributed and all indexers use ntp.
I have manually checked using a securecrt "send all characters immediatly" to check local date and they all match up.
I'll try looking further in to the past. Perhaps the data hasn't properly settled. I wonder if I can tell.
Could it be the @m snapping?
When snapping to the nearest or latest time, Splunk software always snaps backwards or rounds down to the latest time not after the specified time. For example, if it is 11:59:00 and you "snap to" hours, you will snap to 11:00 not 12:00.
could one of the indexearliest/indexlatest be snapping in the wrong direction?