I have a search that works most of the time, but sometimes just causes Splunk to crash and requires a restart. I have a ticket opened with Splunk, but they are still not able to figure out what's going on so I thought I would post this. This is the search I use in my dashboard.
<![CDATA[| metadata type=hosts | eval age = now()-lastTime | where age > 300 and age < 86400 | convert ctime(lastTime) | eval field_in_ddhhmmss=tostring((age) , "duration") |rename field_in_ddhhmmss as "Time Offline" lastTime as "Last Update Time" | join host [search sourcetype=systemInfo | rename serial as "Serial Number" isp as "ISP" state as "State" city as "City"] |sort "Time Offline" a | table "Serial Number","Time Offline","Last Update Time","ISP","City","State"]]>
I use it to find computers that were checking in at least 24 hours ago, but have not checked in for the last 5 min. I then use "join" to match to a sourcetype to the host to get some specific data about those hosts. This search is fairly fast and runs in a couple seconds. I was using the same search for months, but this started to happen a couple weeks ago. The only thing that's changed is we have added more hosts. CPU/memory on the Splunk server is low when it crashes, and we're not seeing any spikes when this happens.
Have you looked at your last Splunk crash log? Is there any errors in the log about too many open files? If this is on a linux server, this could be a ulimit setting issue.
Splunk has had me run a bunch of diag logs and they don't see it crashing. Recently I had started Splunk in debug mode and captured another log for them. I'm still waiting to hear back from them about what they found. I'm just puzzled that it works almost all the time. When I try to crash it to get a log it normally takes me about 20-30 min of almost running it continually until it crashes.
Myself and other users do run this dashboard as a non-admin. I'm not a Splunk admin, but can have one check. How would one check disk quota issues? Myself and the Splunk admins are kind of new to Splunk.
Well, you could run something like this to look for quota issues:
index = _internal sourcetype=splunkd component=DispatchManager quota
(i'm not 100% sure if this covers ad-hoc searches)
You can use this search to determine what each user's quota on the search head is:
| rest splunk_server=local /services/search/jobs | eval diskUsageMB=diskUsage/1024/1024 | stats sum(diskUsageMB) by eai:acl.owner