Is there a query to identify underused fields?
We are optimizing the size of our large indexes. we identified duplicates and noisy logs, but next we want to possibly find fields that arent commonly used and get rid of them. (or if you have any additional advise on cleaning out a large index)
is there a query for this?
understood, would you happen to have any advice on cleaning a big index?
Really the only way to "clean" an index is for the data be aged-out. Running the "| delete" on an index will stop it appearing in searches however it will still be present on the disks, just with markers that stop it being returned, therefore it wont actually give you any space back if this is what you are looking for.
The best thing you can do is control the data arriving in the platform and reduce this as necessary, hopefully over time the older/larger/waste data will age out and free up space.
What is your retention on this index(es)? If its something like 90 days then you wont have too long to wait, but if its 6 years then you might be stuck with that old data for some time!
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
It should be stated up-front that indexes cannot be reduced in size. You must wait for buckets to be frozen for data to go away. The best you can do is reduce how much is stored in new buckets.
You've already taken a good first step by eliminating duplicate events.
Next, look at indexed fields. Fields are best extracted at search-time rather than at index-time. Doing so helps indexer performance, saves space in the indexes, and offers more flexibility with fields.
Look at the INDEXED_EXTRACTIONS settings in your props.conf files. Each of them will create index-time fields. JSON data is especially verbose so KV_MODE=json should be used, instead.
yeah we make adjustments with new indexes, however, the large indexes were created before i got hired. so im actively trying to reduce ingest with whats already flowing. great advice btw.
The best options is to define your use cases and based on those remove unused values before indexing events into disk. But this leads you a situation when you realize a new use case then you must update your indexing definitions to get a new values into splunk.
One thing what you could look is to check that those events don’t contain same information twice or even more times. This can happen when you have some code on your data and then the same information has added as a clear text. A good example is Windows event logs where this happens.
There are also some other cases what you could do like
Unfortunately this is not something which is possible.
I have seen some attempts at this previously, however it is very easy to miss things, as specific fields are not always referenced but could be used, such as the following examples:
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing