Monitoring Splunk

Troubles with disk space / quotas / $SPLUNK_BASE/var/run/splunk/srtemp/

bwakely
Explorer

Greetings all,

Issue: Space on server exhausted, primarily in folder $SPLUNK_HOME/var/run/splunk/srtemp

Splunk version: v4.2.5

OS / version: RedHat Enterprise Linux v.6.2

Steps to replicate:

  • using an app (the 'splunk for f5 (beta)' app (v0.2) in this case), with dashboard views of more than a few charts.

  • extend time range to more than a small amount (a few days),

  • splunk runs out of disk space in a very short time period.

Further Diagnosis:

  • I don't believe this is a problem specific to the app

  • The app collects data from the indexer, slowly accumulating data in the $SPLUNK_HOME/var/run/splunk/dispatch folder. This process obeys user quotas.

  • This continues for some time, then when all the data is collected (or the quota is hit), the dashboards / graphs begin to generate

  • When the 10 dashboard widgets start to populate, splunk starts filling up the 'srtemp' / working directory with calculations.

  • These are populated in parallel, and grow to be very large (each one takes about 1Gb per day's worth of data being crunched in our case),
    so (for example) 10 days of history takes (10days X 1Gb) = 10Gb in under 5 mins.

I believe that this result could be replicated by any dashboard that's intensive enough, so I don't think it's specific to the app - I think that it's a problem with Splunk.

Also, this problem won't really be solved by leaving some amount of overhead, as it will be trivial for a normal user to run the server out of space by doing the following:

  • Generate a dashboard with an arbitrary amount of charts working off the same dataset

  • load the dashboard

  • their dispatch directory will fill up to the quota (e.g. 100M), which helps limit the total size, but the 'srtemp' space will fill up depending on how many charts and how
    complex they are.

I've submitted a support case regarding this (86131),
the response has been:

Currently we don't have any parameter for the limitation of the size of srtemp.

The reason is that we don't know how big is the result might be, to
limit the size of temp folder will cause in-complete search result.

I suggest to leave at least 2GB for the temporary usage.


While I appreciate the response,
for our use case it's trivial for the user to fill up the available space, including any amount of reserve (2GB or upwards) by clicking on the wrong time range, which will end up presenting incomplete search results as well as crashing the server...

I've submitted an enhancement request as part of the same case to implement some kind of per-user quota that is applied to working/temporary space,
but I was wondering if anyone else had come across this problem,
and if so,
how they were dealing with it.

Any similar experiences?

jrodmantcell
Explorer

srtemp is the workspace used for search 'post-process' actions, which are used in Splunk dashboards to provide additional search pipeline processing of searches.

This is useful for dashboards when you want one base search to gather information, and then portions of that information presented in multiple ways, and permits creation of dashboards with complex displays efficiently.

Unfortunately post process actions aren't full searches, and run inside main splunkd, rather than as safe separate search processes, so don't get all the protections built for normal searches. One of the protections they lack is the cleanup logic for stale data in var/run/splunk/dispatch, so if a main splunkd crashes, or power is lost, or similar, then the in-flight data lives forever.

If this location is growing rapidly, while splunk is continuing to work and not crashing, this more likely represents a case where you are post-processing a very large base search in a very popular dashboard (or many dashboards), in which case you may have to hunt down the expensive dashboard and redesign the base search to emit a smaller result set.

Obviously, of course, Splunk should be changed to:

  • clean up srtemp contents over time
  • Apply storage quotas to post process actions in addition to normal searches.

Less obviously, Splunk needs to be changed so that post process actions are fully converted into normal searches. This is somewhat delayed by the need to make full searches start up as quickly as post process actions, which is why they are special in the first place.

,Apparently srtemp is the temp dir for post-process actions, which are used in dashboards when rendering the search results with some additional search logic (usually filtering or charting a small number of items, less than 10 thousand).

Since these post processes run inside main splunkd, and since the search machinery will delete the temp files on completion of the search (or really on each chunk of the search but hopfully you have only one chunk for a postprocess), the implication is that the splunkd main process crashed at some point, or possibly multiple points.

If main splunkd crashes, there's not any current machinery to prune old data out of srtemp. So you could wipe old directories in there live, or you could stop splunk and wipe the entire contents.

There should definitely be automatic cleanup code added to handle cases like splunkd crashing, operating system crash, or power loss. I would argue of course that post process actions should be transformed into proper full searches with all the safety of normal searches applying.

andygerberkp
Explorer

My observation is the timestamps of all the directories in srtemp realate to when splunkd was shut down or crashed (or was killed by an OS out of memory kill process). So in general a find command such as:

find $SPLUNK_HOME/var/run/splunk/srtemp -type d -mtime +30 | xargs rm -rf 

would be a fine thing to run weekly on any highly used Splunk search head.

0 Karma

tzhmaba2
Path Finder

Hi,

I have the same issue but on var/run/dispatchtmp or other var/run directories. Actually the answer from Splunksupport is correct. Noone (including the owner/administrator) of a splunk instance has any idea how big can the results be and if this huge result is now expected or created by mistake. 😞 😞

so.... just add more disk... that's how I have "solved" it.

bwakely
Explorer

I think that we'll just have to go with:
Mount /var/ to a completely separate partition,
so it doesn't exhaust the root partition of the server.

It means we have to provide an extra $RESERVE (2Gb in our case) of unusable space on that partition - if it gets used up, splunk will stop indexing - but it's the lesser of the evils.

Thanks for speaking up.

--Benji

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...