Splunk Frozen Bucket Retention Policy

Mohamamd_Mir · ‎09-02-2024

There is no default solution in Splunk for managing the Frozen Bucket (Path). I wrote a script where you provide a config file specifying the volume or time limit for logs in the Frozen Path for each index. If the policy is violated, the oldest log is deleted.

The script also provides detailed logs of the deletion process, including how much data and time remains in the Frozen Path for each index and how long the deletion process took. The entire script runs as a service and executes once every 24 hours.

I’ve explained the implementation details and all necessary information in the link below.

Mohammad-Mirasadollahi/Splunk-Frozen-Retention-Policy: This repository provides a set of Bash script...

FrozenFreeUp

Mohamamd_Mir · ‎09-02-2024

@PickleRick

Thanks for your kind reply.

I'll try to fix the points you mentioned.

If you think there's anything else that needs fixing, just let me know—I’d appreciate any feedback

PickleRick · ‎09-03-2024

OK. Mind you that those are not directly Splunk-related things, it's more like my personal outlook based on 25 years of admin experience.

1. For me, you're doing too many things. I understand your approach, but I prefer the KISS approach when doing shell scripts. For more complex things I'd go python. But that's my personal taste. I don't like overly complicated bash scripts because they tend to get messy quickly.

To be honest, I would do it completely another way around - do a single simple script to manage single frozen index storage with given parameters (size/time) and possibly add another "managing" script spawning the single script for each index independently.

2. I don't see the point in generating random PROCESS_ID. And even less so - using an external dependency of openssl to generate the value of this variable.

3. You are hardcoding many paths - LOG_FILE, CONFIG_FILE, FROZEN_PATH... It might be ok if you're only writing a one-off script for internal uses. When doing a portable solution it's much more user-friendly to make it configurable. The easiest way would be to externalize those definitions to another file and include from that file using the dot (source) command. Bonus - you can use the same config file in both scripts. Presently you have to configure both scripts separately.

4. Chmod another script so you can run it... that's not nice. It should be in installation instructions.

5. I don't like the idea of a script to create the service file. Just provide a service file template with the instructions to customize it if needed. (I would probably do it with cron instead of a service but that's me - I'm old).

6. IMHO such script manipulating relatively sensitive data should use a lock file to prevent it from being run multiple times in parallel.

7. The mechanics of deleting frozen buckets is highly suboptimal. You're spawning several finds and du after removing each file. That's a lot of unnecessary disk scanning. Also - why removing files from the bucket directory and after that removing an empty directory?

8. To make the script consistent with how Splunk handles buckets you should not use ctime or mtime but rather take the timestamps from the bucket boundaries. (they might result in the same order since probably buckets will be frozen in the same order they should roll out from frozen but - especially if you're using shared storage for frozen across multiple cluster nodes and do deduplication - it's not granted).

9. Sorry to say that but it shows that it was written with ChatGPT - there are some design choices which are inconsistent (like timestamp manipulation and sometimes doing arithmetics using built-in bash functionality whereas other times spawning bc).

So again - I do appreciate the effort. It's just that I would either do it completely differently (which might be simply my personal taste) or - if it was to be a quick and dirty hack - I would simply use tmpreaper (if your distro provides it) or do

find /frozen_path -ctime +Xd -delete

(yes, it doesn't account for size limits, but is quick and reliable)

If you want to use size limits, just list directory sizes, sort by date, sum them up until you hit the limit, delete the rest. Et voila. Honestly, don't overthink it.

Mohamamd_Mir · ‎09-04-2024

Thanks so much for your attention. your feedback really means a lot to me.

I totally agree that there are different ways to reach the same goal. I’ll definitely try to use your suggestions, but honestly, if I were to implement everything you mentioned, it would pretty much turn into a whole new project with a different approach. Using Python was a great idea, but for some reason, I just didn’t end up using it! 😄

Let me explain a bit about some of the points you brought up.

The main thing that made the code a bit complicated is all the logging that’s happening. I needed to log every single event in the project, and the reason I used process IDs was to track everything from start to finish. Since the code is open source, anyone can tweak it to fit their needs.

The task might seem simple (deleting frozen buckets based on a limit), but as you know, once you start working on a project, you run into all sorts of issues. Writing this took me a few weeks, and without ChatGPT, it would’ve taken even longer. I’ve mentioned in the Readme that I got some help from ChatGPT.

For hardcoding some paths, your idea is a good one, and I’m hoping someone will contribute that to the project.

Lastly, I tested this script on 40TB of frozen data with a daily log volume of 5TB, and at least for me, there weren’t any performance issues. Deleting directly (from shell) was just as fast as using the script.

I hope you get a chance to test it out and let me know how it goes. I’d be really happy to use your feedback to improve the project even more.

PickleRick · ‎09-02-2024

Hi Mohammad.

I wanted to take look at your scripts but it seems you're not providing the scripts on github. You've uploaded a tgz archive which makes it impossible to both see the source as well as doing pull requests or seeing diffs on commits. That's not how you host your project on git.

You also provide an inputs.conf file which references some unknown sourcetype. It's not how you normally do. Normally you do an addon containing inputs, props, transforms (if needed) and metadata. The inputs are disabled by default so that the user can enable them if needed.

Also, instructing people to run a script as root by default is a big no-no for me.

I don't mean do discourage you since I'm sure you put some real effort into this. It's just something worth considering if you want to make your "product" better.

Splunk Frozen Bucket Retention Policy

data

indexer

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Splunk Community Badges!

How to find the worst searches in your Splunk environment and how to fix them

Share Your Feedback: On Admin Config Service (ACS)!

Join the Conversation

Splunk Frozen Bucket Retention Policy

data

indexer

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Splunk Community Badges!

How to find the worst searches in your Splunk environment and how to fix them

Share Your Feedback: On Admin Config Service (ACS)!