Hello Splunk community,
I'm using the current Java SDK (1.5.0) to programmatically export large amounts of data from Splunk.
I'm doing this by submitting an export job;
JobExportArgs exportArgs = new JobExportArgs();
exportArgs.setEarliestTime(new DateTime(startMsEpoch).toString());
exportArgs.setLatestTime(new DateTime(endMsEpoch).toString());
exportArgs.setOutputMode(JobExportArgs.OutputMode.XML);
exportArgs.setSearchMode(JobExportArgs.SearchMode.NORMAL);
exportJobStream = splunkService.export("search " + searchFilters, exportArgs);
(Note that all code samples here are not 100% complete, I haven't yet created the minimum code required to reproduce to problem)
Then I create a multi-results reader, and a results iterator;
splunkResultsReader = new MultiResultsReaderXml(exportJobStream);
splunkResultIterator = splunkResultsReader.iterator();
After this, I essentially check if splunkResultIterator.hasNext()
, and if so, I grab splunkEventIterator = splunkResultIterator.next().iterator
While splunkEventIterator.hasNext()
I grab and process Splunk events. When that iterator no longer hasNext()
, I move onto the next value in the splunkResultsIterator
via .hasNext()
and .next()
. When that iterator no longer has values, I consider the search complete, and all data extracted.
This has mostly worked fine for us, but when we start to search for over ~250K records, we start getting around ~10K records more than what the Splunk UI returns. For a ~660K records, I get ~25K more than expected.
If I say, save the event_id
values for those events to a file, and take a look at it, I find that those extra 10K records are all duplicates. I can also run the search query on the Splunk UI along with | stats count by event_id
and I never see duplicates, so I'm a little confused.
I'm looking for any suggestions and/or insight into the problem. I will continue to test, as well as try to setup a minimum set of code to reproduce, rather than taking snippets out of our code base.
--Wes
I have written some code that reproduces the issue for me, and uploaded it to github, here;
https://github.com/FileTrek/SplunkJavaSdkDuplicateRepoduction
The README outlines what needs to be changed to have the tests work in your own environments.
Recently found/fixed a similar issue with the Python SDK, related to previews being mixed in with the results:
https://answers.splunk.com/answers/391949/unreliable-export-using-python-sdk.html#answer-391220
I don't know offhand how to account for this using the Java SDK - take a look at http://docs.splunk.com/Documentation/JavaSDK if you haven't already - maybe set JobExportArgs.OutputMode
to RAW
?
OK. I re-ran my tests using JobExportArgs.OutputMode.JSON
, but read the data manually, rather than with a MultiResultsReaderJson
and parsed each line with Jackson.
Each record did have a "preview":...
value. However, it was always false
, so previews are not causing duplicates it seems.
Thanks for the help though.
You know what, maybe there's something to the preview stuff here. The MultiResultsReader
s may be removing any 'preview' part by the time it gets to me.
According to this; http://dev.splunk.com/view/java-sdk/SP-CAAAEPZ#export
A reporting (transforming) search returns a set of previews followed by the final events, each as separate elements.
A non-reporting (non-transforming) search returns events as they are read from the index, each as separate elements.
A real-time search returns multiple sets of previews, each preview as a separate element.
But through the SDK, I'm only allowed to choose REALTIME
or NORMAL
. Maybe normal can return some preview events? Depending on if it's reporting or non-reporting?
Good thoughts, and thanks for the advice, but sadly doesn't seem to help my case =(
Apparently, only JobExportArgs.SearchMode.REAL_TIME
may return previews , where JobExportArgs.SearchMode.NORMAL
shouldn't (as far as I can tell from reading http://dev.splunk.com/view/java-sdk/SP-CAAAEHQ).
I also tracked both the original, and duplicated Splunk Events in debug, and neither seems to be marked as 'preview', at least, as far as com.splunk.Event
objects can tell...
I can't try JobExportArgs.OutputMode.RAW
AFAIK, since there isn't a MultiResultsReaderRaw
? I'll poke around, maybe there's a different way to parse RAWs rather than MultiResultsReaders...
Thanks again, gives me more to think about, and poke around with. 😃
It may be worth nothing that the entire data set is >10M, but with the search query we're using, the results are in the hundreds of thousands range.
If you just run the same search via curl (export and everything) and dump it all into a file (until it completes), do you also see the duplicates there? My main question is whether we're seeing duplicates that the SDK is generating due to some error or whether Splunk is returning unexpected results.
By the way, is event_id
something in your data, or something you are seeing Splunk return?
Finally, keep in mind that it's not necessarily correct to compare what you see in the UI vs what you see from the SDK - the UI passes in quite a few different parameters that the SDK does not by default (for a variety of reasons), and the UI is not running an export search.
1)
Here's the UI search I use right now;
index="wineventlog" source="WinEventLog:Security" (EventCode=4624 OR EventCode=4625 OR EventCode=4648 OR EventCode=4634 OR EventCode=4647 OR EventCode=4768 OR EventCode=4769 OR EventCode=4770 OR EventCode=4771 OR EventCode=4772 OR EventCode=4773 OR EventCode=4776 OR EventCode=4777) | fields dvc,event_id | fields - _raw
If it add | stats count
, I get 377,173 records. If I do | stats count by event_id
I don't ever see a count over 1 (no duplicates on the event_id
key).
If I do a cURL export like this
curl -k -u <username>:<password> -d "output_mode=csv" -o /data/splunkCurlExport/results.csv https://<server>:8089/servicesNS/<username>/search/search/jobs/export --data-urlencode 'search=search earliest="04/01/2016:00:00:00" latest="04/09/2016:00:00:00" index="wineventlog" source="WinEventLog:Security" (EventCode=4624 OR EventCode=4625 OR EventCode=4648 OR EventCode=4634 OR EventCode=4647 OR EventCode=4768 OR EventCode=4769 OR EventCode=4770 OR EventCode=4771 OR EventCode=4772 OR EventCode=4773 OR EventCode=4776 OR EventCode=4777) | table _time,dvc,event_id'
I can a csv file with 377,173 records. Performing some quich stats cat /data/splunkCurlExport/results.csv | egrep -o "[A-Z0-9]+,[0-9]+" | sort | uniq -c
, I see no duplicates in this data, and get 377,173 records.
The Java SDK export job uses the same search as the above, and currently produces 382,918 records in total (for me atleast). When tracking duplicates (based on dvc
and event_id
keys), there are 5745 duplicates (382,918 - 5745 = 377,173).
2) event_id
is a key specific to Windows Event Logs. I'm actually using dvc
+ event_id
as a unique key for a window event logs entry (with the UI search, I've just been doing event_id
right now).
3) As far as comparing data between the UI and the SDK, fair enough. Most importantly, I'm trying to figure out why the results are different, and why the results seem to contain dupliacte records.
Thanks Wes - this is useful information. Can you try one more thing, which is to do the export as XML and not as CSV from curl? I imagine it'll be the same, but I just want to double check.
We have no known issue with duplicates in the SDK, but that doesn't mean there isn't one. I tried to reproduce this on a large-ish dataset (>1M records), but no luck, with similar code.
I'm getting weird results with the XML export. There seems to be previews mixed in? I don't have an easy way to filter out preview in the XML export at the moment, but it seems like ~168K records came back, but ~50K might be non-preview?
For the Java SDK, I typically use JSON (I was trying XML as well when I wrote this question). So I also ran a cURL export to json.
That job returns 618,701 results, but I can more easily grep
for "preview:false", and I see 377,173 non-preview records.
I'm touching up the minimum-code-to-reproduce and will publish to github. I use JSON export mode for those tests, and as well as using the MultiResultsReaderJson
, but more importantly, I also have tests that read the export stream directly, and parse the JSON with Jackson in that case, rather than the MultiResultsReader, and preview
is always false for those records, but more records than expected are always returned (~382K vs ~377K expected)...
Great - once you have the repro, ping me. My email is my username here @splunk.com 🙂