We have an index with a ton of data. A new use for the data has emerged, so now we want a longer retention time on some of the data in the index. We don't want to simply increase the retention time on the index, because the storage cost is too high. We want to create a new index with a longer retention, pick out the events we need, and copy them to the new index. This is on an indexer cluster.
In theory, we could use collect, like this:
index=oldindex field=the_events_we_need
| collect index=newindex
However, because the index is too big, we're having problems running this search. Even though we run it bit-by-bit, still we end up missing events in the new index. Could be due to performance or memory limits, or bucket issues.
Is there a better and more reliable way of doing this?
If you wanted to move around whole buckets, you could do that with no problem. But apparently you want to get some fields from the original data and "extract" this part into another index. That you cannot do without searching and extracting those fields by Splunk.
The REST-based approach where you automate searching and ingestion (either by means of | collect or by reingesting via HEC (remember to use correct sourcetype so it's treated as stash data)) seems the most convenient way - that way you can search in small chunks so they don't overwhelm your environment.
One more thing. I'm not 100% sure how will Splunk behave around the edges of search time ranges (like "earliest=X latest=X+100" and "earliest=X+100 latest=X+200" - what will happen with the events received exactly at X+100). Either do some testing or just add a failsafe like "earliest=X+99 | where _time>X+100") to avoid duplications.
Hi @hettervik
How much data are we talking here? Is this GB/TB?
Ultimately the best approach to take depends on the amount of data you need to extract/re-index.
The collect approach might still be viable, but should be scripted to run smaller increments continuously until you've extracted what you need. Alternatively you could take a similar approach to incrementally export blocks of the data using the Splunk REST API endpoints, see https://help.splunk.com/en/splunk-enterprise/search/search-manual/9.3/export-search-results/export-d... for more info - you can then re-ingest this using a UF/HF.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Hello @hettervik,
From the scenario, it seems that collect is the only way to achieve your use case. You'll have to try filtering out the events you don't need and better optimize the SPL search and use the collect command so that you do not miss the required events.
However, if you want to migrate the buckets, I've found one of the older community posts that might help you - https://community.splunk.com/t5/Installation/Is-it-possible-to-migrate-indexed-buckets-to-a-differen.... But I would be quite cautious while trying this approach. Haven't tried it myself. But copying the buckets might bring unwanted data to the new index. You can test it out with one of the smaller buckets and test if you achieve the desired result or not.
IMO, collect is the best way to move forward. You can use the following SPL query to keep the original parsing configuration
index = old_index
| <<filter out the events required>>
| fields host source sourcetype _time _raw
| collect index=new_index output_format=hec
Thanks,
Tejas.
---
If the above solution helps, an upvote is appreciated.!!
There was same kind of discussion on slack side some times ago. Maybe this can leads you into correct way? https://splunkcommunity.slack.com/archives/CD9CL5WJ3/p1727111432487429