Good morning (or afternoon) fellow Splunkers,
We've got an issue that has us quite perplexed. I'll post all information that I find relevant, but feel free to request more. The only similar problem I've found is "Why is a Splunk embedded dashboard failing with search head clustering?". While the problem sounds identical to ours, we do not have search head clustering, and our version is 7.2.4.
We have a dashboard. The dashboard contains Simple XML, JavaScript, and CSS. However, the issue only relates to one base search referencing a saved search in Simple XML so the JS and CSS should be irrelevant. There is a saved search that runs every minute. That saved search expires after two minutes. Here is the base search in the Simple XML:
<search id="baseSearchName">
<query>| loadjob savedsearch=[user]:[app]:[savedsearch]
| rename xyz as abc
| dedup abc
</query>
<earliest>-90s</earliest>
<latest>now</latest>
<refresh>2m</refresh>
<refreshType>delay</refreshType>
<sampleRatio>1</sampleRatio>
</search>
Note that we only go back 90 seconds (the saved search runs every minute), and that the dashboard panel should refresh every 2 minutes. This base search goes directly to 9 single value panels (using status_indicator_app
) as seen below:
<panel>
<title>$tok01$ Title</title>
<viz id="viztok01" type="status_indicator_app.status_indicator">
<search base="baseSearchName">
<query>search abc="$tok01$"
| fields - abc
| fieldformat count=round(count) . "%"</query>
</search>
<option name="drilldown">all</option>
<option name="height">250</option>
<option name="refresh.display">none</option>
<option name="status_indicator_app.status_indicator.colorBy">field_value</option>
<option name="status_indicator_app.status_indicator.fillTarget">background</option>
<option name="status_indicator_app.status_indicator.fixIcon">check</option>
<option name="status_indicator_app.status_indicator.icon">field_value</option>
<option name="status_indicator_app.status_indicator.precision">0</option>
<option name="status_indicator_app.status_indicator.showOption">1</option>
<option name="status_indicator_app.status_indicator.staticColor">#65a637</option>
<option name="status_indicator_app.status_indicator.useColors">true</option>
<option name="status_indicator_app.status_indicator.useThousandSeparator">true</option>
<drilldown>
<set token="abc">$tok01$</set>
</drilldown>
</viz>
</panel>
That same code is repeated 9 times (on 3 rows) with tok01 through tok09. Here's where it gets interesting. Everything above works fine. Anyone can pull up the dashboard, and all 9 status indicator app visualizations display their proper values. However, when the dashboard sits overnight, more often than not, one or more random (changes every time) of the panels receive the error in the title. They remain broken with no single value or visualization until either the panel or the dashboard is manually refreshed. This error is noted about once every 30 minutes in the developer console. The error does not show on page load - it only happens over time.
Failed to load resource: the server responded with a status of 500 (Internal Server Error)
This is accompanied with a URL:
https://192.168.0.1/en-US/splunkd/__raw/services/search/jobs/[JobSID]/results_preview?output_mode=json&search=[URL encoding of the search in viztok01]=1569356939184
Which simply displays this:
{"messages":[{"type":"FATAL","text":"Unknown sid."}]}
Which would make me think that it's referencing an expired SID. However, this is not the case. All panels show "<1m ago" as the latest refresh time, and clicking the inspector on each panel references the same (base) search. Clicking into search.log
shows a normal log with no errors for both good and bad panels.
After all the research I've done this week, I'm going to try this "Auto refresh a dashboard" option (which I am pretty confident will work [edit: it did seem to work by refreshing the full dashboard once an hour]), however I still cannot fathom how this happens in the first place. Either all panels should break or none if they are identical and reference the same base search that works fine.
Does anyone have any ideas at all? Am I missing something here?
Edit: After trying Woodcock's solution, we're seeing the same issue. It seems that it could be happening due to some network issue where Splunk can't get the result of the Post Process search. Then, once a panel is broken, Splunk no longer tries to refresh it regardless of the settings.
@woodcock gave me an idea which did end up working. It has been running smoothly for months now. If you ever run into this issue, switch your searches to a reference to a saved search that runs every X minutes. While this is not optimal, this dashboard is used heavily in our environment so we are fine with it running the saved search in the background every 5 minutes. The simple XML looks like this:
<dashboard>
<search id="baseSearchName" ref="savedSearchName">
<refresh>60</refresh>
</search>
The panel part is identical to the one in my original question.
Thanks for the help everyone,
Jacob
@woodcock gave me an idea which did end up working. It has been running smoothly for months now. If you ever run into this issue, switch your searches to a reference to a saved search that runs every X minutes. While this is not optimal, this dashboard is used heavily in our environment so we are fine with it running the saved search in the background every 5 minutes. The simple XML looks like this:
<dashboard>
<search id="baseSearchName" ref="savedSearchName">
<refresh>60</refresh>
</search>
The panel part is identical to the one in my original question.
Thanks for the help everyone,
Jacob
Ack. I should have mentioned this, too.
2 things:
|loadjob
to |inputlookup MyBugWorkAround.csv
which bypasses the need for accessing the search results from the SID
entirely (you will have to add | outputlookup MyBugWorkAround.csv
to your existing saved search.I'll look into that and get back to you. Thanks Woodcock.
Number 2 appears to have worked. Marking this as answer unless we see the problem again.
I am concerned that the dashboard will load while the lookup is being written to, but we'll see if it's a valid worry or not. Hopefully the refresh will take care of it if that does ever happen.
edit: It didn't work.
Just wondering why you used loadjob command instead of savedsearch command there.
Hi @techiesid,
Good question. The point of referring to a separate search instead of just having the search is that our client prefers immediate load times. Even a few seconds of loading bothers them (which to me is not an unreasonable request). the savedsearch
command will run a new search and provide no performance improvements that I am aware of. If we run that saved search every minute, loadjob
will just return the values from the latest run immediately (or close to it) instead of actually running the search. The combination of running the search every minute and always loading the most recent completed run allows our dashboard to load nearly instantly with near real-time data.
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Loadjob
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Savedsearch
Cheers,
Jacob
Hi jacobevans -
We've had failures with loadjob commands simply because of demand on the SHC at the time of supposed completion of the search jobs. We do not have a solution for that so far.
1. You should check load on your searchheads in regards to jobs.
2. Check your jobs - are the saved searches that feed the loadjob completing in time?
3. It looks like there might be some kind of mismatch happening since part of one job (the 60 second search) is included in the loadjob in the dash. Check the expiration of the jobs in the system (this should not be an issue since most savedsearches will expire in 24 hours, however it's good to check).
Grab the jobID (job SID) and do an index=_audit action=* user= over a small time period to get the job ID. Then grab the portion of the jobID (EXAMPLE: search14_1571226849.1835686_8937EF78-DF5A-42E4-A493-B3593F051C27 ) with the panel number and just enough of the Epoch time and ID of itself to differentiate from others.
Widen the search period to like 5 minutes and you should see the entire job lifetime and what happened to it.
Hope this helps,
Mike
Here's the search and log I was able to find. It's a GET request so it's in _internal instead of _audit. I couldn't find anything in _audit at all except for the base search which works fine. The number 1572989217157
comes from the URL given in the console error.
Search:
index=_internal sourcetype=splunkd_ui_access "1572989217157"
Result:
[my_ip.0.1] - me@domain.com [05/Nov/2019:16:29:04.217 -0500] "GET /en-US/splunkd/__raw/services/search/jobs/[base search SID]/results_preview?output_mode=json&search=[URL Encoding of Post Process Search]=1572989217157 HTTP/1.1" 500 193 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36" - 9f91a30129a915df348028e84d7889f7 0ms
The only thing useful in there is the 500 Internal Server Error which I also see in the console.
Hi Mike,
Thanks for the advice - that does make sense. I will look into it and update my question if I find anything useful.
Cheers,
Jacob