We are currently running out of space in one Splunk indexer out of 5 indexers in our distributed environment. Using Splunk 6.2.1 Version.
Total size of the indexer volume is about 5.2TB. Currently we are left out with less then 100 GB of space and everyday an average of 10GB of space is occupied. The data that is occupying space is almost 3.5 year old data. and most of the data is present under the colddb storage unit under the mount point /splogs.
Disk Usage status
df -h /splogs
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_splunk03_san-splunk_logs
5.6T 5.3T 93G 99% /splogs
We could find most of the space is occupied by these indexes.
[net_proxy], [net_fw], [unix_svrs] & [unix_bsm]
Example:
[root@splunk03 splogs]# cd unix_svrs
[root@splunk03 unix_svrs]# ls -ltr
total 416
drwx------ 2 splunk splunk 4096 Apr 19 2012 thaweddb
drwx------ 1590 splunk splunk 102400 Aug 6 09:18 colddb
drwx------ 1890 splunk splunk 131072 Aug 6 12:51 summary
drwx------ 1893 splunk splunk 143360 Aug 6 12:53 datamodel_summary
drwx------ 307 splunk splunk 28672 Aug 6 12:54 db
[root@splunk03 unix_svrs]# du -sh *
1007G colddb
1.6G datamodel_summary
229G db
366M summary
4.0K thaweddb
[root@splunk03 splogs]# cd net_fw
[root@splunk03 net_fw]# ls -ltr
total 612
drwx------ 2 splunk splunk 4096 Apr 19 2012 thaweddb
drwx------ 1358 splunk splunk 131072 Sep 27 2015 summary
drwx------ 2956 splunk splunk 180224 Aug 6 12:17 colddb
drwx------ 3258 splunk splunk 266240 Aug 6 12:55 datamodel_summary
drwx------ 313 splunk splunk 28672 Aug 6 12:55 db
[root@splunk03 net_fw]# du -sh *
**1.3T** colddb
76G datamodel_summary
147G db
24M summary
4.0K thaweddb
Indexes.conf details for these indexers
[volume:Hot]
path = /splogs
[volume:Cold]
path = /splogs
[volume:Base]
path = /splogs
[default]
frozenTimePeriodInSecs = 31536000
[net_fw]
homePath = volume:Hot/net_fw/db
coldPath = volume:Cold/net_fw/colddb
tstatsHomePath = volume:Hot/net_fw/datamodel_summary
thawedPath = $SPLUNK_DB/net_fw/thaweddb
maxTotalDataSizeMB = 250000
[unix_svrs]
homePath = volume:Hot/unix_svrs/db
coldPath = volume:Cold/unix_svrs/colddb
tstatsHomePath = volume:Hot/unix_svrs/datamodel_summary
thawedPath = $SPLUNK_DB/unix_svrs/thaweddb
maxTotalDataSizeMB = 250000
[summary]
frozenTimePeriodInSecs = 188697600
There are other indexers configured in the same manner as shown above in Indexes.conf.
Kindly let me know whether we can delete the data that are present under the colddb directory for the indexer occupying more than 1TB. By doing this, what will be the impact? Or is there any other method we can prevent the failure of the splunk service due to low disk space?
What do you have your bucket size set to in Splunk?
To answer your question, you could change the retention policy to roll your cold data into the frozen bucket which will delete it by default, or you could manually delete data from the cold bucket with no impact
It looks like you have frozenTimePeriodInSecs = 31536000
which means that the newest event in the bucket will be frozen/deleted when it's older than 1 year. You may want to consider reducing this number if you have a high volume of events coming in, or you could grow the disk size
http://docs.splunk.com/Documentation/Splunk/6.4.2/Indexer/Setaretirementandarchivingpolicy
thanks skoelpin,
There are two frozenTimePeriodInSecs = 31536000 are set one is under the default stanza and another one is under summary stanza frozenTimePeriodInSecs = 188697600 as show in Indexes.conf . So which one will be taken in consideration for deleting the data.
You would want to change this under your Default stanza
You can delete data in your colddb by doing an rm -rf
but I'm not sure if this is the proper way of doing it. I've personally done it 3 times already when we were in a crisis mode and there was zero negative side affects with doing it this way.
So to answer your question, yes you technically could rm -rf
data in the colddb to clear up some room
thank Skoelpin, for sharing your experience. I might be doing the same after getting the required permission as its in prod environment.
Whether you can remove old buckets or not depends on whether you need the data in those buckets or not - we can't help you there.
That being said, taking a look at your config I have a few pointers.
Are your indexers sharing that 5.2TB? If so, are all five indexers writing into the same path? That's looking for trouble.
Doing the maths suggests this is the case, each indexer is configured to consume up to 250GB for each of those indexes. Multiplied by five that's 1.25TB for each index - both currently are at about 1.25TB.
You should see old buckets being removed all the time - search index=_internal component=bucketmover idx=unix_svrs
or the other index... if you're at the maximum configured space, Splunk will throw out oldest buckets on its own and the size should not grow further.
If you need more space for other indexes AND have figured out that you can throw out more old data, you could reduce maxTotalDataSizeMB on the indexers a bit. Then they'll throw out more old buckets. Just deleting buckets while Splunk is using them is again looking for trouble.
Another point, you've configured a year of data retention. Do check if your disk is large enough to make it to one year, assuming that year is based on compliance "must store a year" rather than privacy "cannot store more than a year".
"Expert"... hummmmm.
I see two approaches, removing two volume definitions and mapping all indexes that used those two volumes to the remaining volume at the same path, or moving the volumes' paths to different locations.
In both cases I highly recommend testing that in a non-prod environment first, and possibly talking to support as well.
thanks martin, actually we don't have a test machine in our environment, so I have planning to raise a ticket with splunk support team to seek their help on this.
index=_internal
is set to retain at least 30 days by default, so seeing only data from this and the previous month is no surprise. If you're only seeing moves from warm to cold then it seems Splunk isn't even trying to freeze (=delete by default) data from cold.
I'd recommend you fix the overlapping volumes first. Considering the documentation explicitly forbids this configuration I wouldn't be surprised if this weird behaviour around not freezing buckets might go away then.
I also wouldn't be surprised if it didn't go away, but still - if you find an obviously wrong configuration, fixing it is rarely a bad idea.
thanks martin for helping me. But I am not sure how to change the configuration, as it was done by a splunk expert, who is currently not there in the organization. Is it possible to guide me on steps to change the configuration.
One thing in your config that may or may not contribute to issues: You have three volumes all pointing at the same path. That's explicitly forbidden in the docs:
path = <path on server>
* Required.
* Points to the location on the file system where all databases that use this volume will
reside. You must make sure that this location does not overlap with that of any other
volume or index database.
http://docs.splunk.com/Documentation/Splunk/6.2.1/admin/indexesconf
H/T to @dshpritz 🙂
thanks martin, I tried to execute the below query with time frame set as All-time. But I could see that data are present only for two months (From July 16 to current date). Even executed index=_internal* I could see only data available from July 2016 to till now.
index=_internal component=bucketmover idx=net_fw splunk_server=splunk03.
1:37:27.302 AM
08-10-2016 01:37:27.302 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467094714_1467081929_6749' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/10/16
1:37:27.278 AM
08-10-2016 01:37:27.278 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467081941_1467072478_6748' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/10/16
1:37:27.255 AM
08-10-2016 01:37:27.255 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467072469_1467058580_6747' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/9/16
10:39:23.905 PM
08-09-2016 22:39:23.905 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467058579_1467047364_6746' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/9/16
7:28:41.901 PM
08-09-2016 19:28:41.901 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467047363_1467031159_6745' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
We can see only last two months data's are available when we run the above query. kindly let me know it will show like this if we run the above query.
thank in advance.
Martin, can you guide me whether I can still go head and delete the content that are older data present under /splogs/net_fw/colddb/ older data.
Why not just manually delete it from the file system then set your limits on how much to retain your in colddb?
Ya, but i am yet to get the permission to delete data from the organization. where I can find the limits.conf file as there are 46 Indexes.conf file are configured in all the 5 individual servers. So do you want me to change in all the 5 server under the path /opt/splunk/etc/system/local/limits.conf
thanks in advance.
Okay, so Splunk knows about 1.3TB in that index on that indexer but also knows it should keep it below 250GB? That feels wrong.
Are you sure there are no errors, warnings, etc. around component=BucketMover or similar things in index=_internal
?
It seems dbinspect
is picky about spaces - make sure you remove the spaces around the equals sign: | dbinspect index=net_fw
yes, you are right, I got this output
|dbinspect index=net_fw | search state=cold splunk_server=splunk03 | stats count sum(sizeOnDiskMB)
count 2974
1344829.046891
|dbinspect index=* | search state=cold splunk_server=splunk03 | stats count sum(sizeOnDiskMB)
count 8644
3437339.870991
Regarding the query, I forgot to change the index - you should of course use | dbinspect index=net_fw
to match yours. Regarding the time range, use All Time, not All Time (Real-time)... though two years should have the same effect.
If you still see nothing, remove the | search
and check if your splunk server's name is correct.
If you still see nothing, have one of your Splunk admins run the query - you might be lacking permissions then.
Regarding cleaning up, it seems you have an old app from 2013 that used to define the indexes, and a new app starting with ADMIN also defining the indexes. Splunk is good at merging these, but having multiple locations just increases the room for human error.
Martin, after executing the query with time period for 2 year but I am getting no result found. Even tried to remove the search command but still no luck. Regarding permission I hope I am having the admin privilege.
|dbinspect index = net_fw | search state=cold splunk_server=splunk03 | stats count sum(sizeOnDiskMB).
Regarding the old app "/opt/splunk/etc/apps/all_indexer_base/local/indexes.conf.2013.06.03" should I need to uncomment the entire stanza.
thanks in advance.
The configuration as output by btool looks good, no replication going on and the 250GB ceiling was recognized. You should eventually clean up the four different locations all defining indexes.conf, but that's not the issue here - btool merges things correctly.
Regarding DMC - I think the Indexes views were added in 6.3 or 6.4.
As an alternative, you can run | dbinspect index=_internal | search state=cold splunk_server=Martin-PC | stats count sum(sizeOnDiskMB)
over all time, might take a moment.
Compare the results with what you see on disk - I'm trying to check if Splunk is still using any of the buckets... ie if starting the freeze didn't happen, or if the freeze itself failed. If you spot buckets on disk that aren't known to Splunk you should be able to rm those fairly safely, and Splunk will probably never clean them on its own.
In both cases, there should be events in _internal complaining about errors; are all BucketMover events just moves from warm to cold? Make sure to not just check 60 minutes, freezing may not happen every day.