Hello Splunk community,
I'm using the current Java SDK (1.5.0) to programmatically export large amounts of data from Splunk.
I'm doing this by submitting an export job;
JobExportArgs exportArgs = new JobExportArgs();
exportArgs.setEarliestTime(new DateTime(startMsEpoch).toString());
exportArgs.setLatestTime(new DateTime(endMsEpoch).toString());
exportArgs.setOutputMode(JobExportArgs.OutputMode.XML);
exportArgs.setSearchMode(JobExportArgs.SearchMode.NORMAL);
exportJobStream = splunkService.export("search " + searchFilters, exportArgs);
(Note that all code samples here are not 100% complete, I haven't yet created the minimum code required to reproduce to problem)
Then I create a multi-results reader, and a results iterator;
splunkResultsReader = new MultiResultsReaderXml(exportJobStream);
splunkResultIterator = splunkResultsReader.iterator();
After this, I essentially check if splunkResultIterator.hasNext() , and if so, I grab splunkEventIterator = splunkResultIterator.next().iterator
While splunkEventIterator.hasNext() I grab and process Splunk events. When that iterator no longer hasNext() , I move onto the next value in the splunkResultsIterator via .hasNext() and .next() . When that iterator no longer has values, I consider the search complete, and all data extracted.
This has mostly worked fine for us, but when we start to search for over ~250K records, we start getting around ~10K records more than what the Splunk UI returns. For a ~660K records, I get ~25K more than expected.
If I say, save the event_id values for those events to a file, and take a look at it, I find that those extra 10K records are all duplicates. I can also run the search query on the Splunk UI along with | stats count by event_id and I never see duplicates, so I'm a little confused.
I'm looking for any suggestions and/or insight into the problem. I will continue to test, as well as try to setup a minimum set of code to reproduce, rather than taking snippets out of our code base.
--Wes
... View more