We have an index with a ton of data. A new use for the data has emerged, so now we want a longer retention time on some of the data in the index. We don't want to simply increase the retention time on the index, because the storage cost is too high. We want to create a new index with a longer retention, pick out the events we need, and copy them to the new index. This is on an indexer cluster.
In theory, we could use collect, like this:
index=oldindex field=the_events_we_need
| collect index=newindex
However, because the index is too big, we're having problems running this search. Even though we run it bit-by-bit, still we end up missing events in the new index. Could be due to performance or memory limits, or bucket issues.
Is there a better and more reliable way of doing this?
Thanks for your help and suggestions. We ended up using the collect method, as first presented. We could perhaps migrate or copy buckets on disk, but as we needed specific events from the index, that wouldn't work. Also, using some sort of scripts to automate the process seemed like too much work - there don't seem to be an easy way to do this in Splunk, and the scripts would have to keep track of if searches failed or is successfull as well, so it would be complicated to implement.
In the end, someone had to "manually" run the collect search, bit by bit, backwards on time, over the whole big index. Run the collect search on a timeslot, then if successfull, run it on the previous timeslot, and so on.
Just for the sake of completness and future reference - in splunk 10 there is a new functionality of "split index" but it only works for very specific set of use cases. More info - https://help.splunk.com/en/splunk-enterprise/administer/manage-indexers-and-indexer-clusters/10.0/ma...
Thanks for your help and suggestions. We ended up using the collect method, as first presented. We could perhaps migrate or copy buckets on disk, but as we needed specific events from the index, that wouldn't work. Also, using some sort of scripts to automate the process seemed like too much work - there don't seem to be an easy way to do this in Splunk, and the scripts would have to keep track of if searches failed or is successfull as well, so it would be complicated to implement.
In the end, someone had to "manually" run the collect search, bit by bit, backwards on time, over the whole big index. Run the collect search on a timeslot, then if successfull, run it on the previous timeslot, and so on.
If you wanted to move around whole buckets, you could do that with no problem. But apparently you want to get some fields from the original data and "extract" this part into another index. That you cannot do without searching and extracting those fields by Splunk.
The REST-based approach where you automate searching and ingestion (either by means of | collect or by reingesting via HEC (remember to use correct sourcetype so it's treated as stash data)) seems the most convenient way - that way you can search in small chunks so they don't overwhelm your environment.
One more thing. I'm not 100% sure how will Splunk behave around the edges of search time ranges (like "earliest=X latest=X+100" and "earliest=X+100 latest=X+200" - what will happen with the events received exactly at X+100). Either do some testing or just add a failsafe like "earliest=X+99 | where _time>X+100") to avoid duplications.
Hi @hettervik
How much data are we talking here? Is this GB/TB?
Ultimately the best approach to take depends on the amount of data you need to extract/re-index.
The collect approach might still be viable, but should be scripted to run smaller increments continuously until you've extracted what you need. Alternatively you could take a similar approach to incrementally export blocks of the data using the Splunk REST API endpoints, see https://help.splunk.com/en/splunk-enterprise/search/search-manual/9.3/export-search-results/export-d... for more info - you can then re-ingest this using a UF/HF.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Hello @hettervik,
From the scenario, it seems that collect is the only way to achieve your use case. You'll have to try filtering out the events you don't need and better optimize the SPL search and use the collect command so that you do not miss the required events.
However, if you want to migrate the buckets, I've found one of the older community posts that might help you - https://community.splunk.com/t5/Installation/Is-it-possible-to-migrate-indexed-buckets-to-a-differen.... But I would be quite cautious while trying this approach. Haven't tried it myself. But copying the buckets might bring unwanted data to the new index. You can test it out with one of the smaller buckets and test if you achieve the desired result or not.
IMO, collect is the best way to move forward. You can use the following SPL query to keep the original parsing configuration
index = old_index
| <<filter out the events required>>
| fields host source sourcetype _time _raw
| collect index=new_index output_format=hec
Thanks,
Tejas.
---
If the above solution helps, an upvote is appreciated.!!
There was same kind of discussion on slack side some times ago. Maybe this can leads you into correct way? https://splunkcommunity.slack.com/archives/CD9CL5WJ3/p1727111432487429