Getting Data In

How To Better Deal with Gaps in Remote Data

Splunk Employee
Splunk Employee

I need to figure out how I can gracefully revise data that's already been indexed.

My use case is this: We are monitoring counters that get exist on a few hundred servers, and eventually get summed up (outside of splunk) and tossed into a CSV file. This CSV file has a counter delta for every 10 minutes, e.g.:

Timestamp,Datacenter,Hits,Misses
"2011-01-01 01:10:00","Singapore",3553,245
"2011-01-01 01:20:00","Singapore",5253,386
"2011-01-01 01:30:00","Singapore",1253,124

However, if a particular server is overloaded at the time of the collection, of goes offline for a day because of a bad power supply, the data might come in late, and so the number of hits and misses for any time slice will change. This can happen for up to a 48 hour window.

To pre-empt, I can't use a Splunk Forwarder for this, since it summarizes it gets into the guts of an internal system. But I'm trying to figure out what the best way to deal with this is. Ideas I've had so far:

  1. Constantly reindex the file and just always do stats first(Hits), first(Misses) by Datacenter, _time -- obviously pretty inefficient use of buckets, and ugly
  2. Constantly | delete and then reindex the file -- very inefficient use of buckets
  3. Have a "Messy" index with the raw data, based on one of the above methods, and then summarize that into a "Final" index (without actually summarizing) after the window.

Is there are a "right" way to do this?

Tags (1)
1 Solution

Splunk Employee
Splunk Employee

Since it appears there is no good way to do that, I've taken the following tact:

  1. Outside of Splunk, I have a script that parses the log files, and outputs only new or changed entries to the end of a logfile that Splunk monitors. My data then looks like:

    Timestamp,Datacenter,Hits,Misses
    "2011-01-01 01:10:00","Singapore",3553,245
    "2011-01-01 01:20:00","Singapore",5253,386
    "2011-01-01 01:30:00","Singapore",1253,124
    "2011-01-01 01:20:00","Singapore",1449,154

  2. I have it go to a separate index. This log source happens to be very small, and putting it in a different index allows one bucket to contain a lot of data. My (unverified) theory is that if it went to main, it would mess with the date range timestamps on the very busy buckets, and create inefficiencies. That may not be true, but creating a separate index works for both performance and erring on the side of caution.

  3. All of my searches first execute a | bucket _time span=10m | stats first(Hits) as Hits, first(Misses) as Misses by datacenter, _time. That way I can use my normal searches, without dealing with old or duplicate data.
  4. I have a nightly crontab that will delete my summary indexes for the last few days (earliest=-3d@d | delete) and then re-index them. That will also produce some bucket-time issues, but it hasn't been a big problem so far.

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

Since it appears there is no good way to do that, I've taken the following tact:

  1. Outside of Splunk, I have a script that parses the log files, and outputs only new or changed entries to the end of a logfile that Splunk monitors. My data then looks like:

    Timestamp,Datacenter,Hits,Misses
    "2011-01-01 01:10:00","Singapore",3553,245
    "2011-01-01 01:20:00","Singapore",5253,386
    "2011-01-01 01:30:00","Singapore",1253,124
    "2011-01-01 01:20:00","Singapore",1449,154

  2. I have it go to a separate index. This log source happens to be very small, and putting it in a different index allows one bucket to contain a lot of data. My (unverified) theory is that if it went to main, it would mess with the date range timestamps on the very busy buckets, and create inefficiencies. That may not be true, but creating a separate index works for both performance and erring on the side of caution.

  3. All of my searches first execute a | bucket _time span=10m | stats first(Hits) as Hits, first(Misses) as Misses by datacenter, _time. That way I can use my normal searches, without dealing with old or duplicate data.
  4. I have a nightly crontab that will delete my summary indexes for the last few days (earliest=-3d@d | delete) and then re-index them. That will also produce some bucket-time issues, but it hasn't been a big problem so far.

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

Any ideas/opinions on what the best way to do this is?

0 Karma