Re: Using the REST API in Python to export large s...

karan1337 · ‎07-05-2015

Hi,

I am trying to export (Stream) huge search results by using the REST API directly in python. For 1 minute of data, I get about 600,000 events. For 10 minutes I am able to get the data, but when I increase the time for more than 10 minutes, the search auto finalizes. (I see in the Jobs page that my search is not available in the UI, but the dispatch status is "finalizing")

My export search is something like:

index=somename sourcetype=somename earliest=-20m | table _indextime, _raw

Is there any setting that restricts even the export api from streaming all results?

martin_mueller · ‎07-05-2015

For large jobs you'd be better off creating a search "traditionally" by POSTing to search/jobs instead of search/jobs/export, retrieve the sid, and then load results off that sid. See this snippet from the docs:

If it is too big, you might instead run with the search/jobs (not search/jobs/export) endpoint (it takes POST with the same parameters), maybe using the exec_mode=blocking. You'll then get back a search id, and then you can page through the results and request them from the server under your control, which is a better approach for extremely large result sets that need to be chunked.

http://docs.splunk.com/Documentation/Splunk/6.2.3/RESTREF/RESTsearch#search.2Fjobs.2Fexport

martin_mueller · ‎07-06-2015

You may get much better speeds if you set output_mode=raw:

$ curl -k -u admin:changeme https://localhost:8089/services/search/jobs/export -d search="search index=_internal" -d output_mode=raw > outfile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  686M    0  686M    0    45  13.1M      0 --:--:--  0:00:52 --:--:-- 12.0M
$ cat outfile | wc -l
4007497

Four million events, 700MB, 52 seconds, run on my home all-in-one Splunk instance.

karan1337 · ‎07-07-2015

Thanks @martin_mueller. I will try that out.

martin_mueller · ‎07-06-2015

| table * is a terrible idea because it tells Splunk to extract ALL the fields. Consider | table _raw instead if that's all you're looking to export.

martin_mueller · ‎07-05-2015

Here's what the docs recommend on exporting large volumes: http://docs.splunk.com/Documentation/Splunk/6.2.3/Search/Exportsearchresults#Python_SDK

karan1337 · ‎07-05-2015

@martin_mueller I also tried POSTing to /search/jobs. For a large set of results (more than 10 million), this endpoint is not giving me more than 500,009 results ( i don't know the reason for this number). When i append | table * to my query, i do get all results but the result took more than 1 hour to stream back to my remote system from the splunk machine. Such a long time might not be practical for my use case.

karan1337 · ‎07-05-2015

@martin_mueller I tried this and the only issue was streaming using the SDK is taking a hit on performance in my use case. Export or search directly using the REST API is way faster than using the SDK.

Using the REST API in Python to export large search results, why does the search auto finalize?

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms

Updated Team Landing Page in Splunk Observability