Solved: summarizing _raw data in an index to reduce index ...

sonicZ · ‎04-01-2014

Our co. has been gathering auditd logs since last summer now our Splunk infrastructure is getting very fat on the indexed auditd data. I cant delete this data either since we require it for audits.
The solution i was coming up with was to start summarizing the _raw data from some other previous
examples i've seen

index=audit | dedup _raw | rename _raw as orig_raw

Then verifying the summarized results vs indexed results and expiring data off colddb sooner then it is now.
Is there a better solution out there? the main goal is to reduce index disk usage.

lguinn2 · ‎04-01-2014

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

View solution in original post

lguinn2 · ‎04-01-2014

Splunk is already compressing the raw data. If your main goal is to reduce disk usage, then my first question is: must the data be always searchable? Or is it simply a requirement that the data must be retrievable if needed?

If you specify a cold-to-frozen directory and a shorter lifetime, Splunk will move "expired" buckets into the frozen directory. In the frozen directory, the buckets will be approximately 30% of their former size - because most of the index info is stripped away. Most folks then store the frozen buckets offline, but you don't have to.

However, frozen buckets are not searchable; you have to rebuild a bucket rebuild to use its contents. But if the data is very rarely searched and really just kept for compliance, this could be a good solution.

I don't think that dedup is going to help you unless you truly have exact duplicates of a lot of your data.

lguinn2 · ‎04-03-2014

You shouldn't need the coldToFrozenScript. Just make sure that the "Frozen archive path" is set to a real directory. Splunk will automatically strip off everything it can when it puts the compressed data into that directory.

In indexes.conf the frozen archive path is set like this:

coldToFrozenDir = <path to frozen archive>

Note that the path cannot contain a volume reference.

sonicZ · ‎04-02-2014

The data does not need to be searchable, retrievable upon request would work for us.
I've always used Cold to frozen as our delete mechanism, I suppose i'll have to use the coldToFrozenScript.

Is the default $SPLUNK_HOME/bin/coldToFrozenExample.py the script that will convert buckets to 30% of their normal size?

summarizing _raw data in an index to reduce index size

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation

summarizing _raw data in an index to reduce index size

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...