Re: We have a shortage of disk space in one indexe...

Hemnaath · ‎08-06-2016

We are currently running out of space in one Splunk indexer out of 5 indexers in our distributed environment. Using Splunk 6.2.1 Version.
Total size of the indexer volume is about 5.2TB. Currently we are left out with less then 100 GB of space and everyday an average of 10GB of space is occupied. The data that is occupying space is almost 3.5 year old data. and most of the data is present under the colddb storage unit under the mount point /splogs.

Disk Usage status

df -h /splogs
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_splunk03_san-splunk_logs
                      5.6T  5.3T   93G  99% /splogs

We could find most of the space is occupied by these indexes.

[net_proxy], [net_fw], [unix_svrs] & [unix_bsm]

Example:

[root@splunk03 splogs]# cd unix_svrs
[root@splunk03 unix_svrs]# ls -ltr
total 416
drwx------    2 splunk splunk   4096 Apr 19  2012 thaweddb
drwx------ 1590 splunk splunk 102400 Aug  6 09:18 colddb
drwx------ 1890 splunk splunk 131072 Aug  6 12:51 summary
drwx------ 1893 splunk splunk 143360 Aug  6 12:53 datamodel_summary
drwx------  307 splunk splunk  28672 Aug  6 12:54 db
[root@splunk03 unix_svrs]# du -sh *
1007G   colddb
1.6G    datamodel_summary
229G    db
366M    summary
4.0K    thaweddb

[root@splunk03 splogs]# cd net_fw
[root@splunk03 net_fw]# ls -ltr
total 612
drwx------    2 splunk splunk   4096 Apr 19  2012 thaweddb
drwx------ 1358 splunk splunk 131072 Sep 27  2015 summary
drwx------ 2956 splunk splunk 180224 Aug  6 12:17 colddb
drwx------ 3258 splunk splunk 266240 Aug  6 12:55 datamodel_summary
drwx------  313 splunk splunk  28672 Aug  6 12:55 db
[root@splunk03 net_fw]# du -sh *
**1.3T**    colddb
76G     datamodel_summary
147G    db
24M     summary
4.0K    thaweddb

Indexes.conf details for these indexers

[volume:Hot]
path = /splogs

[volume:Cold]
path = /splogs

[volume:Base]
path = /splogs

[default]
frozenTimePeriodInSecs = 31536000

[net_fw]
homePath = volume:Hot/net_fw/db
coldPath = volume:Cold/net_fw/colddb
tstatsHomePath = volume:Hot/net_fw/datamodel_summary
thawedPath = $SPLUNK_DB/net_fw/thaweddb
maxTotalDataSizeMB = 250000

[unix_svrs]
homePath = volume:Hot/unix_svrs/db
coldPath = volume:Cold/unix_svrs/colddb
tstatsHomePath = volume:Hot/unix_svrs/datamodel_summary
thawedPath = $SPLUNK_DB/unix_svrs/thaweddb
maxTotalDataSizeMB = 250000

[summary]
frozenTimePeriodInSecs = 188697600

There are other indexers configured in the same manner as shown above in Indexes.conf.

Kindly let me know whether we can delete the data that are present under the colddb directory for the indexer occupying more than 1TB. By doing this, what will be the impact? Or is there any other method we can prevent the failure of the splunk service due to low disk space?

skoelpin · ‎08-06-2016

What do you have your bucket size set to in Splunk?

To answer your question, you could change the retention policy to roll your cold data into the frozen bucket which will delete it by default, or you could manually delete data from the cold bucket with no impact

It looks like you have frozenTimePeriodInSecs = 31536000 which means that the newest event in the bucket will be frozen/deleted when it's older than 1 year. You may want to consider reducing this number if you have a high volume of events coming in, or you could grow the disk size

http://docs.splunk.com/Documentation/Splunk/6.4.2/Indexer/Setaretirementandarchivingpolicy

Hemnaath · ‎08-07-2016

thanks skoelpin,

There are two frozenTimePeriodInSecs = 31536000 are set one is under the default stanza and another one is under summary stanza frozenTimePeriodInSecs = 188697600 as show in Indexes.conf . So which one will be taken in consideration for deleting the data.

skoelpin · ‎08-07-2016

You would want to change this under your Default stanza

You can delete data in your colddb by doing an rm -rf but I'm not sure if this is the proper way of doing it. I've personally done it 3 times already when we were in a crisis mode and there was zero negative side affects with doing it this way.

So to answer your question, yes you technically could rm -rf data in the colddb to clear up some room

Hemnaath · ‎08-09-2016

thank Skoelpin, for sharing your experience. I might be doing the same after getting the required permission as its in prod environment.

martin_mueller · ‎08-06-2016

Whether you can remove old buckets or not depends on whether you need the data in those buckets or not - we can't help you there.

That being said, taking a look at your config I have a few pointers.
Are your indexers sharing that 5.2TB? If so, are all five indexers writing into the same path? That's looking for trouble.
Doing the maths suggests this is the case, each indexer is configured to consume up to 250GB for each of those indexes. Multiplied by five that's 1.25TB for each index - both currently are at about 1.25TB.
You should see old buckets being removed all the time - search index=_internal component=bucketmover idx=unix_svrs or the other index... if you're at the maximum configured space, Splunk will throw out oldest buckets on its own and the size should not grow further.
If you need more space for other indexes AND have figured out that you can throw out more old data, you could reduce maxTotalDataSizeMB on the indexers a bit. Then they'll throw out more old buckets. Just deleting buckets while Splunk is using them is again looking for trouble.

Another point, you've configured a year of data retention. Do check if your disk is large enough to make it to one year, assuming that year is based on compliance "must store a year" rather than privacy "cannot store more than a year".

martin_mueller · ‎08-10-2016

"Expert"... hummmmm.

I see two approaches, removing two volume definitions and mapping all indexes that used those two volumes to the remaining volume at the same path, or moving the volumes' paths to different locations.
In both cases I highly recommend testing that in a non-prod environment first, and possibly talking to support as well.

Hemnaath · ‎08-11-2016

thanks martin, actually we don't have a test machine in our environment, so I have planning to raise a ticket with splunk support team to seek their help on this.

martin_mueller · ‎08-10-2016

index=_internal is set to retain at least 30 days by default, so seeing only data from this and the previous month is no surprise. If you're only seeing moves from warm to cold then it seems Splunk isn't even trying to freeze (=delete by default) data from cold.

I'd recommend you fix the overlapping volumes first. Considering the documentation explicitly forbids this configuration I wouldn't be surprised if this weird behaviour around not freezing buckets might go away then.
I also wouldn't be surprised if it didn't go away, but still - if you find an obviously wrong configuration, fixing it is rarely a bad idea.

Hemnaath · ‎08-10-2016

thanks martin for helping me. But I am not sure how to change the configuration, as it was done by a splunk expert, who is currently not there in the organization. Is it possible to guide me on steps to change the configuration.

martin_mueller · ‎08-09-2016

One thing in your config that may or may not contribute to issues: You have three volumes all pointing at the same path. That's explicitly forbidden in the docs:

path = <path on server>
  * Required. 
  * Points to the location on the file system where all databases that use this volume will 
    reside.  You must make sure that this location does not overlap with that of any other 
    volume or index database.

http://docs.splunk.com/Documentation/Splunk/6.2.1/admin/indexesconf
H/T to @dshpritz 🙂

Hemnaath · ‎08-10-2016

thanks martin, I tried to execute the below query with time frame set as All-time. But I could see that data are present only for two months (From July 16 to current date). Even executed index=_internal* I could see only data available from July 2016 to till now.

index=_internal component=bucketmover idx=net_fw splunk_server=splunk03.

1:37:27.302 AM

08-10-2016 01:37:27.302 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467094714_1467081929_6749' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/10/16
1:37:27.278 AM

08-10-2016 01:37:27.278 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467081941_1467072478_6748' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/10/16
1:37:27.255 AM

08-10-2016 01:37:27.255 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467072469_1467058580_6747' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/9/16
10:39:23.905 PM
08-09-2016 22:39:23.905 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467058579_1467047364_6746' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd
8/9/16
7:28:41.901 PM

08-09-2016 19:28:41.901 -0400 INFO BucketMover - idx=net_fw Moving bucket='db_1467047363_1467031159_6745' because maximum number of warm databases exceeded, starting warm_to_cold: from='/splogs/net_fw/db' to='/splogs/net_fw/colddb'
host = splunk03 function_splunkindexer production tier_production type_nix webops source = /opt/splunk/var/log/splunk/splunkd.log sourcetype = splunkd

We can see only last two months data's are available when we run the above query. kindly let me know it will show like this if we run the above query.
thank in advance.

Hemnaath · ‎08-10-2016

Martin, can you guide me whether I can still go head and delete the content that are older data present under /splogs/net_fw/colddb/ older data.

skoelpin · ‎08-10-2016

Why not just manually delete it from the file system then set your limits on how much to retain your in colddb?

Hemnaath · ‎08-10-2016

Ya, but i am yet to get the permission to delete data from the organization. where I can find the limits.conf file as there are 46 Indexes.conf file are configured in all the 5 individual servers. So do you want me to change in all the 5 server under the path /opt/splunk/etc/system/local/limits.conf

thanks in advance.

martin_mueller · ‎08-09-2016

Okay, so Splunk knows about 1.3TB in that index on that indexer but also knows it should keep it below 250GB? That feels wrong.
Are you sure there are no errors, warnings, etc. around component=BucketMover or similar things in index=_internal?

martin_mueller · ‎08-09-2016

It seems dbinspect is picky about spaces - make sure you remove the spaces around the equals sign: | dbinspect index=net_fw

Hemnaath · ‎08-09-2016

yes, you are right, I got this output

|dbinspect index=net_fw | search state=cold splunk_server=splunk03 | stats count sum(sizeOnDiskMB)

count 2974
1344829.046891

|dbinspect index=* | search state=cold splunk_server=splunk03 | stats count sum(sizeOnDiskMB)

count 8644
3437339.870991

martin_mueller · ‎08-09-2016

Regarding the query, I forgot to change the index - you should of course use | dbinspect index=net_fw to match yours. Regarding the time range, use All Time, not All Time (Real-time)... though two years should have the same effect.
If you still see nothing, remove the | search and check if your splunk server's name is correct.
If you still see nothing, have one of your Splunk admins run the query - you might be lacking permissions then.

Regarding cleaning up, it seems you have an old app from 2013 that used to define the indexes, and a new app starting with ADMIN also defining the indexes. Splunk is good at merging these, but having multiple locations just increases the room for human error.

Hemnaath · ‎08-09-2016

Martin, after executing the query with time period for 2 year but I am getting no result found. Even tried to remove the search command but still no luck. Regarding permission I hope I am having the admin privilege.

|dbinspect index = net_fw | search state=cold splunk_server=splunk03 | stats count sum(sizeOnDiskMB).

Regarding the old app "/opt/splunk/etc/apps/all_indexer_base/local/indexes.conf.2013.06.03" should I need to uncomment the entire stanza.
thanks in advance.

martin_mueller · ‎08-09-2016

The configuration as output by btool looks good, no replication going on and the 250GB ceiling was recognized. You should eventually clean up the four different locations all defining indexes.conf, but that's not the issue here - btool merges things correctly.

Regarding DMC - I think the Indexes views were added in 6.3 or 6.4.
As an alternative, you can run | dbinspect index=_internal | search state=cold splunk_server=Martin-PC | stats count sum(sizeOnDiskMB) over all time, might take a moment.
Compare the results with what you see on disk - I'm trying to check if Splunk is still using any of the buckets... ie if starting the freeze didn't happen, or if the freeze itself failed. If you spot buckets on disk that aren't known to Splunk you should be able to rm those fairly safely, and Splunk will probably never clean them on its own.
In both cases, there should be events in _internal complaining about errors; are all BucketMover events just moves from warm to cold? Make sure to not just check 60 minutes, freezing may not happen every day.

We have a shortage of disk space in one indexer. Can we delete data present in the colddb directory?

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes