Reporting

How to move data to a different index without creating duplicates or holes?

davidatpinger
Path Finder

I've got some data in an index that has a retention time that is intentionally short, but some of the data in that index is of higher value and I want to retain it for a longer period. I've been looking at setting up a scheduled search that uses 'collect', but I don't see a mechanism to run a scheduled search such that there's a high level of fidelity in the data - no duplicates and no holes. Since this data is more valuable we want to make sure we get it all!

Is there a simple mechanism to do such a thing? I'm thinking I want to make the base search reach far enough back in time to not miss any data that has shown up since the last run, then deduplicate against the existing data in the target index (which might be complicated without _raw) and then 'collect' whatever is left into the target.

0 Karma
1 Solution

davidatpinger
Path Finder

Now that I clarified the question, it occurred to me that the solution is simple: the subsearch needs to return a value for earliest.

This seems to do what I want:

[search index=target-index| head 1 | eval _time=_time + 0.001 | stats latest(_time) as earliest] latest=-1m@m index=source-index | collect index=target-index addtime=true

I had an issue where I'd get a single line of duplication every time it runs, since the event returned by the subsearch is included in the collect-ed search. Adding a bit of time seems to do the trick.

View solution in original post

0 Karma

davidatpinger
Path Finder

Now that I clarified the question, it occurred to me that the solution is simple: the subsearch needs to return a value for earliest.

This seems to do what I want:

[search index=target-index| head 1 | eval _time=_time + 0.001 | stats latest(_time) as earliest] latest=-1m@m index=source-index | collect index=target-index addtime=true

I had an issue where I'd get a single line of duplication every time it runs, since the event returned by the subsearch is included in the collect-ed search. Adding a bit of time seems to do the trick.

View solution in original post

0 Karma

davidatpinger
Path Finder

This suffers a bit when an indexer restarts. Hmm. It really needs to run with _index_earliest. No idea how to pass that as of yet.

0 Karma

davidatpinger
Path Finder

The only downside here is that I was using _index_earliest, since that gives me some certainty about catching events that are delayed in reaching the indexers, for whatever reason. Since the latest event in the target index may or may not have been indexed at a time near to _time for that event, there's some slop there. Also, I can't seem to pass _index_earliest from a subsearch, although I can pass earliest. So there's some edge conditions where I might miss some events, but it should be pretty darn close. It might even be good enough.

0 Karma

davidatpinger
Path Finder

So, time is still a complicated problem. I've got a saved search that runs once a minute that does something like:

_index_earliest=-2m@m _index_latest=-1m@m | collect index=target-index addtime=true

This works great as long as the search head that runs the search is up when the search is scheduled to run. If it's not (due to a restart, for example), there's a gap. What I really want is to be able to say something like:

earliest=[search index=target-index | head 1 | fields _time]

...but that's not valid syntax, of course. Still not really sure how to dynamically insert a time in the 'earliest' term that corresponds to the last entry in the target.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

If your wanting to filter out some "noise" in one index and only keep the important stuff in a separate index which also increases the reporting speed, a summary index is perfect for this

https://wiki.splunk.com/Community:Summary_Indexing

*Summary indexes do NOT count against your license

davidatpinger
Path Finder

So, is the standard practice to just wait a while (say, a day), and then do something like

search earliest=-2d@d latest=-1d@d | collect

That seems a bit haphazard. I guess I'm looking for more insurance that I get exactly what I want without any possibility of a data problem, and collect doesn't really do any checking - it just moves data to a summary index of your choice.

It's worth noting that collect does NOT count against your license UNLESS you change the sourcetype of your data from the default (stash, I think), in which case it DOES count against your license. I think that's a bug, but I've verified with support that this is how it presently works.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Why not specify EXACTLY what you want to to summerize? This way you will not miss anything

Example, say you have data from a single source /etc/xxx/logs/12_1_2016.log and you want to use a stats command by a certain field. If you had a lot of "noise", this may take awhile to filter out the noise and only return what your looking for.

Your populating search will look like

index=foo source="/etc/xxx/logs/12_1_2016.log | stats count by FIELD"

If you wanted the "insurance" of knowing you got everything, then why not just run the search first and verify you got everything as expected? Splunk is a great tool and do exactly what you ask it to do.. If your query is not correct then obviously you will miss some data..

Also, you can run the populating search every 5 minutes or every day if you wanted to.

Lastly, why do you keep using the collect command? Just run a search which returns the results as expected and summarize it into a new index to increase reporting speed

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!