Knowledge Management

summarizing _raw data in an index to reduce index size

sonicZ
Contributor

Our co. has been gathering auditd logs since last summer now our Splunk infrastructure is getting very fat on the indexed auditd data. I cant delete this data either since we require it for audits.
The solution i was coming up with was to start summarizing the _raw data from some other previous
examples i've seen

index=audit | dedup _raw | rename _raw as orig_raw

Then verifying the summarized results vs indexed results and expiring data off colddb sooner then it is now.
Is there a better solution out there? the main goal is to reduce index disk usage.

0 Karma
1 Solution

lguinn2
Legend

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

View solution in original post

lguinn2
Legend

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

lguinn2
Legend

You shouldn't need the coldToFrozenScript. Just make sure that the "Frozen archive path" is set to a real directory. Splunk will automatically strip off everything it can when it puts the compressed data into that directory.

In indexes.conf the frozen archive path is set like this:

coldToFrozenDir = <path to frozen archive>

Note that the path cannot contain a volume reference.

0 Karma

sonicZ
Contributor

The data does not need to be searchable, retrievable upon request would work for us.
I've always used Cold to frozen as our delete mechanism, I suppose i'll have to use the coldToFrozenScript.

Is the default $SPLUNK_HOME/bin/coldToFrozenExample.py the script that will convert buckets to 30% of their normal size?

0 Karma
Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

This challenge was first posted on Slack #puzzles channelFor a previous puzzle, I needed a set of fixed-length ...

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

  &#x1f680; Your data just got a serious AI upgrade — are you ready? Say hello to the Agentic Era with the ...

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...