Splunk Dev

Disk Quota Limits, Search API Endpoint Differences and Parameters

stranjer
Loves-to-Learn Lots

Disk Quota Limits, Search API Endpoint Differences and Parameters

Looking for better clarity and deeper understanding to better solve a recurring issue I'm seeing.

We have a script performing searches using the API. Currently, the flow works like this:

  • Start search by doing POST /services/search/jobs with search as parameter, get back search id (sid).
  • Run loop to do GET /services/search/jobs/{sid} and check the search job status until done
  • Pull back in results of search by doing GET /services/search/jobs/{sid}/results
  • After results are pulled back, do POST /services/search/jobs/{sid}/control to cancel search and delete the result cache

The script is designed to not try to run more than a few searches at a time, and to wait until earlier searches have been canceled to start the next search.

However, we are still sometimes hitting the search disk quota limitation. We've increased this limit a few times well past the default, which has reduced frequency but issue still comes up. We do NOT want to change this to unlimited, nor keep increasing it every time it gets hit.There are a few questions I'm not able to find documentation on when trying to figure out solutions:

  • Is there normally delay after a search has successfully been cancelled before the results cache is removed?
    • Or somehow until the disk usage quota is updated to reflect the cleared space?
  • Would doing a DELETE /services/search/jobs/{sid} clear up space quicker?
  • Would switching to using the /services/search/jobs/export endpoint help?
    • If the results are streaming, do they also still persist on disk?
    • The Python & Java SDK docs say export searches '...return results in a stream, rather than as a search job that is saved on the server.' But I'm not sure that means the result cache isn't saved.
  • Does setting a low 'timeout' value in the search/jobs parameter clear the disk space after that value has passed?
  • With the 'auto_cancel' parameter what counts as 'inactivity'?
    •  checking the status of the SID?
    • retrieving results?
    • If accidently set to do 'search index=* ' for all time, does this stop it before completion? ( I assume so, but wanted confirmation )

The documentation is unclear on what some phases mean (like 'inactivity', or 'rather than saved on the server' in the SDK docs ), and some other parts are likely simplifications/abstracts of concepts I need to understand more in-depth (cancelling/deleting jobs, clearing disk space).

Trying to avoid just putting band-aids on a bullet wound, but need more details to determine the right treatment.

Labels (2)
0 Karma

mleati
Explorer

I am facing the exactly same issues. Our technique for getting the search results is also exactly the same. In the past we tried using export, Splunk Python SDK and a few other things. We had issues with all of them for searches that return a lot of data (millions of events, GBs of data). Have you found a solution? 

0 Karma

mleati
Explorer

Here are my findings:

"Would switching to using the /services/search/jobs/export endpoint help"?

From the documentation for search/jobs/export endpoint:

If it is too big, you might instead run with the search/jobs (not search/jobs/export) endpoint (it takes POST with the same parameters), maybe using the exec_mode=blocking. 

So, using export endpoint does not appear to be an option for the search returning a lot of data.

"With the 'auto_cancel' parameter what counts as 'inactivity'?"

I have no idea but in my testing (setting auto_cancel to 2 in the POST request creating the job) it did not make any difference

"Would doing a DELETE /services/search/jobs/{sid} clear up space quicker?"

This worked the same way as using POST request on search/jobs/{search_id}/control endpoint (with action=cancel). In Job Manager, the job disappeared after the job execution pretty quickly. In both cases, Splunk returned this:

{"messages":[{"type":"INFO","text":"Search job cancelled."}]}

I suspect that the same code handles both requests.

So, the good thing is that this seems to work but my testing was simple and it may not be able to catch corner cases. For example, if the same search is executed again while the original search is still running, will Splunk delete the first job when it is done? The seconds search might be reading cached results so the request to delete the first job might be rejected. And what will happen to the delete/cancel request after the second search is done? 

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...