Disk Quota Limits, Search API Endpoint Differences and Parameters
Looking for better clarity and deeper understanding to better solve a recurring issue I'm seeing.
We have a script performing searches using the API. Currently, the flow works like this:
The script is designed to not try to run more than a few searches at a time, and to wait until earlier searches have been canceled to start the next search.
However, we are still sometimes hitting the search disk quota limitation. We've increased this limit a few times well past the default, which has reduced frequency but issue still comes up. We do NOT want to change this to unlimited, nor keep increasing it every time it gets hit.There are a few questions I'm not able to find documentation on when trying to figure out solutions:
The documentation is unclear on what some phases mean (like 'inactivity', or 'rather than saved on the server' in the SDK docs ), and some other parts are likely simplifications/abstracts of concepts I need to understand more in-depth (cancelling/deleting jobs, clearing disk space).
Trying to avoid just putting band-aids on a bullet wound, but need more details to determine the right treatment.
I am facing the exactly same issues. Our technique for getting the search results is also exactly the same. In the past we tried using export, Splunk Python SDK and a few other things. We had issues with all of them for searches that return a lot of data (millions of events, GBs of data). Have you found a solution?
Here are my findings:
"Would switching to using the /services/search/jobs/export endpoint help"?
From the documentation for search/jobs/export endpoint:
If it is too big, you might instead run with the search/jobs (not search/jobs/export) endpoint (it takes POST with the same parameters), maybe using the exec_mode=blocking.
So, using export endpoint does not appear to be an option for the search returning a lot of data.
"With the 'auto_cancel' parameter what counts as 'inactivity'?"
I have no idea but in my testing (setting auto_cancel to 2 in the POST request creating the job) it did not make any difference
"Would doing a DELETE /services/search/jobs/{sid} clear up space quicker?"
This worked the same way as using POST request on search/jobs/{search_id}/control endpoint (with action=cancel). In Job Manager, the job disappeared after the job execution pretty quickly. In both cases, Splunk returned this:
{"messages":[{"type":"INFO","text":"Search job cancelled."}]}
I suspect that the same code handles both requests.
So, the good thing is that this seems to work but my testing was simple and it may not be able to catch corner cases. For example, if the same search is executed again while the original search is still running, will Splunk delete the first job when it is done? The seconds search might be reading cached results so the request to delete the first job might be rejected. And what will happen to the delete/cancel request after the second search is done?