I need to figure out how I can gracefully revise data that's already been indexed.
My use case is this: We are monitoring counters that get exist on a few hundred servers, and eventually get summed up (outside of splunk) and tossed into a CSV file. This CSV file has a counter delta for every 10 minutes, e.g.:
Timestamp,Datacenter,Hits,Misses
"2011-01-01 01:10:00","Singapore",3553,245
"2011-01-01 01:20:00","Singapore",5253,386
"2011-01-01 01:30:00","Singapore",1253,124
However, if a particular server is overloaded at the time of the collection, of goes offline for a day because of a bad power supply, the data might come in late, and so the number of hits and misses for any time slice will change. This can happen for up to a 48 hour window.
To pre-empt, I can't use a Splunk Forwarder for this, since it summarizes it gets into the guts of an internal system. But I'm trying to figure out what the best way to deal with this is. Ideas I've had so far:
Is there are a "right" way to do this?
Since it appears there is no good way to do that, I've taken the following tact:
Outside of Splunk, I have a script that parses the log files, and outputs only new or changed entries to the end of a logfile that Splunk monitors. My data then looks like:
Timestamp,Datacenter,Hits,Misses
"2011-01-01 01:10:00","Singapore",3553,245
"2011-01-01 01:20:00","Singapore",5253,386
"2011-01-01 01:30:00","Singapore",1253,124
"2011-01-01 01:20:00","Singapore",1449,154
I have it go to a separate index. This log source happens to be very small, and putting it in a different index allows one bucket to contain a lot of data. My (unverified) theory is that if it went to main, it would mess with the date range timestamps on the very busy buckets, and create inefficiencies. That may not be true, but creating a separate index works for both performance and erring on the side of caution.
| bucket _time span=10m | stats first(Hits) as Hits, first(Misses) as Misses by datacenter, _time
. That way I can use my normal searches, without dealing with old or duplicate data. earliest=-3d@d | delete
) and then re-index them. That will also produce some bucket-time issues, but it hasn't been a big problem so far.Since it appears there is no good way to do that, I've taken the following tact:
Outside of Splunk, I have a script that parses the log files, and outputs only new or changed entries to the end of a logfile that Splunk monitors. My data then looks like:
Timestamp,Datacenter,Hits,Misses
"2011-01-01 01:10:00","Singapore",3553,245
"2011-01-01 01:20:00","Singapore",5253,386
"2011-01-01 01:30:00","Singapore",1253,124
"2011-01-01 01:20:00","Singapore",1449,154
I have it go to a separate index. This log source happens to be very small, and putting it in a different index allows one bucket to contain a lot of data. My (unverified) theory is that if it went to main, it would mess with the date range timestamps on the very busy buckets, and create inefficiencies. That may not be true, but creating a separate index works for both performance and erring on the side of caution.
| bucket _time span=10m | stats first(Hits) as Hits, first(Misses) as Misses by datacenter, _time
. That way I can use my normal searches, without dealing with old or duplicate data. earliest=-3d@d | delete
) and then re-index them. That will also produce some bucket-time issues, but it hasn't been a big problem so far.Any ideas/opinions on what the best way to do this is?