Hi,
I am trying to export (Stream) huge search results by using the REST API directly in python. For 1 minute of data, I get about 600,000 events. For 10 minutes I am able to get the data, but when I increase the time for more than 10 minutes, the search auto finalizes. (I see in the Jobs page that my search is not available in the UI, but the dispatch status is "finalizing")
My export search is something like:
index=somename sourcetype=somename earliest=-20m | table _indextime, _raw
Is there any setting that restricts even the export api from streaming all results?
For large jobs you'd be better off creating a search "traditionally" by POSTing to search/jobs
instead of search/jobs/export
, retrieve the sid, and then load results off that sid. See this snippet from the docs:
If it is too big, you might instead run with the search/jobs (not search/jobs/export) endpoint (it takes POST with the same parameters), maybe using the exec_mode=blocking. You'll then get back a search id, and then you can page through the results and request them from the server under your control, which is a better approach for extremely large result sets that need to be chunked.
http://docs.splunk.com/Documentation/Splunk/6.2.3/RESTREF/RESTsearch#search.2Fjobs.2Fexport
You may get much better speeds if you set output_mode=raw
:
$ curl -k -u admin:changeme https://localhost:8089/services/search/jobs/export -d search="search index=_internal" -d output_mode=raw > outfile
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 686M 0 686M 0 45 13.1M 0 --:--:-- 0:00:52 --:--:-- 12.0M
$ cat outfile | wc -l
4007497
Four million events, 700MB, 52 seconds, run on my home all-in-one Splunk instance.
Thanks @martin_mueller. I will try that out.
| table *
is a terrible idea because it tells Splunk to extract ALL the fields. Consider | table _raw
instead if that's all you're looking to export.
Here's what the docs recommend on exporting large volumes: http://docs.splunk.com/Documentation/Splunk/6.2.3/Search/Exportsearchresults#Python_SDK
@martin_mueller I also tried POSTing to /search/jobs. For a large set of results (more than 10 million), this endpoint is not giving me more than 500,009 results ( i don't know the reason for this number). When i append | table * to my query, i do get all results but the result took more than 1 hour to stream back to my remote system from the splunk machine. Such a long time might not be practical for my use case.
@martin_mueller I tried this and the only issue was streaming using the SDK is taking a hit on performance in my use case. Export or search directly using the REST API is way faster than using the SDK.