Disk Quota Limits, Search API Endpoint Differences...

stranjer · ‎02-03-2021

Disk Quota Limits, Search API Endpoint Differences and Parameters

Looking for better clarity and deeper understanding to better solve a recurring issue I'm seeing.

We have a script performing searches using the API. Currently, the flow works like this:

Start search by doing POST /services/search/jobs with search as parameter, get back search id (sid).
Run loop to do GET /services/search/jobs/{sid} and check the search job status until done
Pull back in results of search by doing GET /services/search/jobs/{sid}/results
After results are pulled back, do POST /services/search/jobs/{sid}/control to cancel search and delete the result cache

The script is designed to not try to run more than a few searches at a time, and to wait until earlier searches have been canceled to start the next search.

However, we are still sometimes hitting the search disk quota limitation. We've increased this limit a few times well past the default, which has reduced frequency but issue still comes up. We do NOT want to change this to unlimited, nor keep increasing it every time it gets hit.There are a few questions I'm not able to find documentation on when trying to figure out solutions:

Is there normally delay after a search has successfully been cancelled before the results cache is removed?
- Or somehow until the disk usage quota is updated to reflect the cleared space?
Would doing a DELETE /services/search/jobs/{sid} clear up space quicker?
Would switching to using the /services/search/jobs/export endpoint help?
- If the results are streaming, do they also still persist on disk?
- The Python & Java SDK docs say export searches '...return results in a stream, rather than as a search job that is saved on the server.' But I'm not sure that means the result cache isn't saved.
Does setting a low 'timeout' value in the search/jobs parameter clear the disk space after that value has passed?
With the 'auto_cancel' parameter what counts as 'inactivity'?
- checking the status of the SID?
- retrieving results?
- If accidently set to do 'search index=* ' for all time, does this stop it before completion? ( I assume so, but wanted confirmation )

The documentation is unclear on what some phases mean (like 'inactivity', or 'rather than saved on the server' in the SDK docs ), and some other parts are likely simplifications/abstracts of concepts I need to understand more in-depth (cancelling/deleting jobs, clearing disk space).

Trying to avoid just putting band-aids on a bullet wound, but need more details to determine the right treatment.

mleati · ‎10-08-2021

I am facing the exactly same issues. Our technique for getting the search results is also exactly the same. In the past we tried using export, Splunk Python SDK and a few other things. We had issues with all of them for searches that return a lot of data (millions of events, GBs of data). Have you found a solution?

mleati · ‎10-08-2021

Here are my findings:

"Would switching to using the /services/search/jobs/export endpoint help"?

From the documentation for search/jobs/export endpoint:

If it is too big, you might instead run with the search/jobs (not search/jobs/export) endpoint (it takes POST with the same parameters), maybe using the exec_mode=blocking.

So, using export endpoint does not appear to be an option for the search returning a lot of data.

"With the 'auto_cancel' parameter what counts as 'inactivity'?"

I have no idea but in my testing (setting auto_cancel to 2 in the POST request creating the job) it did not make any difference

"Would doing a DELETE /services/search/jobs/{sid} clear up space quicker?"

This worked the same way as using POST request on search/jobs/{search_id}/control endpoint (with action=cancel). In Job Manager, the job disappeared after the job execution pretty quickly. In both cases, Splunk returned this:

{"messages":[{"type":"INFO","text":"Search job cancelled."}]}

I suspect that the same code handles both requests.

So, the good thing is that this seems to work but my testing was simple and it may not be able to catch corner cases. For example, if the same search is executed again while the original search is still running, will Splunk delete the first job when it is done? The seconds search might be reading cached results so the request to delete the first job might be rejected. And what will happen to the delete/cancel request after the second search is done?

Disk Quota Limits, Search API Endpoint Differences and Parameters

rest API

SDK

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!