Some background:
So we are having some problems in our environment, we have a cluster of indexers and some of the servers are close to getting full while others are at 50-60% free disk space (keep in mind they have the same amount if inital disk space).
Hot/warm is kept locally (and that is where the disk space problem resides) and cold is moved over to SAN.
I understand that there could be an issue with replication and that a couple of servers could have more disk space occupied,
Most of the indexes have the "homePath.maxDataSizeMB" set but some just are just rolled after a certain amount of time, as I wasn't the one setting it up and usually don't manage it so I don't know the reasons behind how and why everything was configured as it is.
My questions is:
Is there any way of balancing the indexes between indexers beside setting strict values or measuring indexed data per day vs. hot/warm disk space available?
The problem with this is that we have a lot of indexes (40+) and I don't really have the knowledge of the environment and time to make a judgement on how to tune it.
Sorry for the late reply here, but we solved this as while ago. There were lots of data being stored after backup script during a hardware migration that resided in the same path as all the hot/warm buckets in Splunk that took up ~30% of the local disk space.
We deleted it and everything is up and running fine now.
Sorry for the late reply here, but we solved this as while ago. There were lots of data being stored after backup script during a hardware migration that resided in the same path as all the hot/warm buckets in Splunk that took up ~30% of the local disk space.
We deleted it and everything is up and running fine now.
🙂 Would you mind accepting your answer? There should be a button that says Accept Answer below my comment somewhere
As alacercogitatus mentioned, you'll want to look at how the heavy-forwarders are connecting to the indexers (every forwarder has every indexer in outputs.conf, there isn't network connectivity issues, etc), but one other thing worth mentioning is that you might have a large amount of excess buckets on the indexers near capacity. These accumulate over time, especially if you are performing maintenance on the indexers, and depending on your rep/search factors. They can be removed periodically to free up space.
See : http://docs.splunk.com/Documentation/Splunk/6.2.3/Indexer/Removeextrabucketcopies
Already tried removing excess buckets a few times a week but no major change in free disk space, I'm contemplating if there is data on these machines that aren't recognized by Splunk anymore, is there any way of finding this out?
A quick item to check is your forwarders. Make sure that the forwarders are load balancing correctly, and have the entire set of indexers configured as outputs.
How many Heavy Forwarders are balancing their data onto how many Indexers? Too few HFs can cause "rolling denial of service" attacks against your own Indexers...
@alacer's search might be quicker like this:
| tstats count where index=* by splunk_server index sourcetype
Thanks @martin_mueller. I keep forgetting about tstats.
Did the search you mentioned but couldn't find the server with disk problem, did another search on that server only, like:
|tstats count where index=* splunk_server="$host*" by index sourcetype
When doing this I only got a small number of results compared to the other servers. ~25 000 compared to 147 million on one of the servers that has no disk issue problems.
There could be something missing, as I'm not really aware of all the intricacies of Splunk.
We are sending all data through Heavy Forwarders and they are then loadblanced to the indexcluster.
Interesting. Why the Heavy Forwarders? Can you verify that the events are being spread correctly? You might be able to tell if there is a wayward input somewhere.
|metasearch index=* | stats count by splunk_server sourcetype
This will give you a better picture of what Indexers are receiving which sourcetype, and if they aren't even, you could probably find the wayward input.